Breve introduzione ad htdig

scritto da Antonio Bonifati <antonio.bonifati@libero.it>
In questo testo descriveremo l'installazione e l'utilizzo basilare di htdig dai sorgenti. Dettagli relativi a FreeBSD, ma dovrebbe essere tutto molto simile su un altro tipo di Unix.

Introduzione

htdig, o meglio ht://Dig è un programma in grado di indicizzare ed effettuare ricerche in siti web, adatto per un piccolo dominio o una intranet e viene distribuito secondo la GPL. Gli eseguibili principali di htdig sono 3:
htdig
crea il database necessario alla ricerca (digging)
htmerge
crea gli indici di ricerca e quando si fa l'indicizzazione incrementale fa il merge dei documenti che sono cambiati nel database di ricerca (merging)
htsearch
CGI che effettua la ricerca (searching)

Download

Scaricare dal sito http://www.htdig.org l'ultima versione stabile, ad es. htdig-3.1.6.tar.gz.

Scompattare i sorgenti

# tar zxvf htdig-3.1.6.tar.gz -C /usr/local/src
# cd /usr/local/src/htdig-3.1.6 

Lettura documentazione

# less README
# lynx htdoc/install.html

Configurazione

# ./configure --help | less
# ./configure --prefix=/usr/local/htdig \
--with-cgi-bin-dir=/usr/local/apache/cgi-bin \
--with-image-dir=/usr/local/apache/htdocs/htdig \
--with-search-dir=/usr/local/apache/htdocs/htdig 

Compilazione

# make 

Installazione

# make install
Installing ht://Dig
  
Creating directories (if needed)...
 mkdir /usr/local/htdig
 mkdir /usr/local/htdig/bin
 mkdir /usr/local/htdig/conf
 mkdir /usr/local/htdig/common
 mkdir /usr/local/htdig/db
 mkdir /usr/local/apache/htdocs/htdig
  
Installing individual programs...
 transform=s,x,x,
 /usr/bin/install -c htfuzzy /usr/local/htdig/bin/`echo htfuzzy | sed ''`
 transform=s,x,x,
 /usr/bin/install -c htdig /usr/local/htdig/bin/`echo htdig | sed ''`
 /usr/bin/install -c htdig /usr/local/htdig/bin/`echo htdump | sed ''`
 /usr/bin/install -c htdig /usr/local/htdig/bin/`echo htload | sed ''`
 transform=s,x,x,
 /usr/bin/install -c htsearch /usr/local/apache/cgi-bin/`echo htsearch | sed ''`
 transform=s,x,x,
 /usr/bin/install -c htmerge /usr/local/htdig/bin/`echo htmerge | sed ''`
 transform=s,x,x,
 /usr/bin/install -c htnotify /usr/local/htdig/bin/`echo htnotify | sed ''`
  
Installing default configuration files...
 /usr/local/htdig/conf/htdig.conf
 /usr/local/apache/htdocs/htdig/search.html
 /usr/local/htdig/common/header.html
 /usr/local/htdig/common/footer.html
 /usr/local/htdig/common/wrapper.html
 /usr/local/htdig/common/nomatch.html
 /usr/local/htdig/common/syntax.html
 /usr/local/htdig/common/long.html
 /usr/local/htdig/common/short.html
 /usr/local/htdig/common/bad_words
 /usr/local/htdig/common/english.0
 /usr/local/htdig/common/english.aff
 /usr/local/htdig/common/synonyms

Installing images...
 /usr/local/apache/htdocs/htdig/button1.gif
 /usr/local/apache/htdocs/htdig/button2.gif
 /usr/local/apache/htdocs/htdig/button3.gif
 /usr/local/apache/htdocs/htdig/button4.gif
 /usr/local/apache/htdocs/htdig/button5.gif
 /usr/local/apache/htdocs/htdig/button6.gif
 /usr/local/apache/htdocs/htdig/button7.gif
 /usr/local/apache/htdocs/htdig/button8.gif
 /usr/local/apache/htdocs/htdig/button9.gif
 /usr/local/apache/htdocs/htdig/buttonl.gif
 /usr/local/apache/htdocs/htdig/buttonr.gif
 /usr/local/apache/htdocs/htdig/button10.gif
 /usr/local/apache/htdocs/htdig/htdig.gif
 /usr/local/apache/htdocs/htdig/star.gif
 /usr/local/apache/htdocs/htdig/star_blank.gif
 /usr/local/apache/htdocs/htdig/button1.png
 /usr/local/apache/htdocs/htdig/button2.png
 /usr/local/apache/htdocs/htdig/button3.png
 /usr/local/apache/htdocs/htdig/button4.png
 /usr/local/apache/htdocs/htdig/button5.png
 /usr/local/apache/htdocs/htdig/button6.png
 /usr/local/apache/htdocs/htdig/button7.png
 /usr/local/apache/htdocs/htdig/button8.png
 /usr/local/apache/htdocs/htdig/button9.png
 /usr/local/apache/htdocs/htdig/buttonl.png
 /usr/local/apache/htdocs/htdig/buttonr.png
 /usr/local/apache/htdocs/htdig/button10.png
 /usr/local/apache/htdocs/htdig/htdig.png
 /usr/local/apache/htdocs/htdig/star.png
 /usr/local/apache/htdocs/htdig/star_blank.png
 Creating rundig script...
 Installation done.
  
 Before you can start searching, you will need to create a
 search database.  A sample script to do this has been
 installed as  /usr/local/htdig/bin/rundig 

Controllo cgi installata

# ls -l /usr/local/apache/cgi-bin/htsearch
-rwxr-xr-x  1 root  wheel  1066412 May 21 18:38 /usr/local/apache/cgi-bin/htsearch 

Prova di indicizzazione

Proviamo ad indicizzare la documentazione stessa di htdig, usando lo script di esempio /usr/local/htdig/bin/rundig per indicizzare. Installiamo il manuale di htdig sotto la document root, in modo che sia visibile via web:
# cp -R /usr/local/src/htdig-3.1.6/htdoc /usr/local/apache/htdocs/htdig/
# links http://localhost/htdig/htdoc
Partiamo dal file di configurazione di default /usr/local/htdig/conf/htdig.conf, piuttosto che modificarlo scegliamo di farne una copia che ad es. chiamiamo htdoc.conf:
# cd /usr/local/htdig/conf
# cp htdig.conf htdoc.conf
e quindi la modifichiamo secondo il patch seguente:
htdig.diff
----------
18c18
< database_dir:        /usr/local/htdig/db
---
> database_dir:        /usr/local/htdig/db/htdoc
28c28,30
< start_url:           http://www.htdig.org/
---
> local_urls:          http://localhost/htdig/htdoc/=/usr/local/apache/htdocs/htdig/htdoc/
> local_urls_only:     true
> start_url:           http://localhost/htdig/htdoc/

# patch htdoc.conf htdig.diff
Hmm...  Looks like a normal diff to me...
Patching file htdoc.conf using Plan A...
Hunk #1 succeeded at 18.
Hunk #2 succeeded at 28.
done
in questo modo l'indicizzazione avviene solo tramite il filesystem locale e non è necessario avere attivo alcun server web per poterla fare.
# mkdir /usr/local/htdig/db/htdoc/
# /usr/local/htdig/bin/rundig -c /usr/local/htdig/conf/htdoc.conf
Nel caso volete più dettagli, usate il comando:
# /usr/local/htdig/bin/rundig -v -c /usr/local/htdig/conf/htdoc.conf | less
l'opzione -vv aumenterebbe ulteriormente il livello di verbosità e ancora di più -vvv.

Applichiamo poi una patch al file http://localhost/htdig/htdoc/contents.html per poter usare la cgi di ricerca locale invece di quella di http://www.htdig.org/:
contents.diff
-----------------
50c50
<         <form action="http://www.htdig.org/cgi-bin/htsearch" target=body>
---
>         <form action="http://localhost/cgi-bin/htsearch" target=body>
55c55
<         <input type=hidden name=config value=htdig>
---
>         <input type=hidden name=config value=htdoc>


# cd /usr/local/apache/htdocs/htdig/htdoc/
# patch contents.html contents.diff
Hmm...  Looks like a normal diff to me...
Patching file contents.html using Plan A...
Hunk #1 succeeded at 50.
Hunk #2 succeeded at 55.
done
facciamo lo stesso per l'altro form di ricerca, nella pagina: http://ninux.rett.polimi.it/htdig/htdoc/main.html
main.diff
---------
73c73
<       <form action="http://cgi.htdig.org/cgi-bin/htsearch" method="post">
---
>       <form action="http://localhost/cgi-bin/htsearch" method="post">
76c76
<              
<input type="hidden" name="config" value="htdig">
---
>              
<input type="hidden" name="config" value="htdoc">


# cd /usr/local/apache/htdocs/htdig/htdoc/
# patch main.html main.diff
Hmm...  Looks like a normal diff to me...
Patching file main.html using Plan A...
Hunk #1 succeeded at 73.
Hunk #2 succeeded at 76.
done
Infine provate a fare delle ricerche, richiamando uno di questi indirizzi tramite un browser web locale (o sostituite il nome host della vostra macchina se usate un browser remoto):

http://localhost/htdig/htdoc/
http://localhost/htdig/search.html

I file di database:
# ls -l /usr/local/htdig/db/htdoc
total 1254
-rw-r--r--  1 root  wheel  207872 Jun  5 16:48 db.docdb
-rw-r--r--  1 root  wheel    6144 Jun  5 16:48 db.docs.index
-rw-r--r--  1 root  wheel  421690 Jun  5 16:48 db.wordlist
-rw-r--r--  1 root  wheel  588800 Jun  5 16:48 db.words.db

Lavorare con più database

Ora indicizzeremo altra documentazione locale, facendo un database ed un file di configurazione separato per ciascuna documentazione. Ad es. possiamo indicizzare la documentazion di PHP, che sul mio sistema si trova visibile via web, in /usr/local/apache/htdocs/doc/php_manual_en. Copiamo il file di configurazione htdoc.conf in php.conf:
# cd /usr/local/htdig/conf
# cp htdoc.conf php.conf
quindi modifichiamolo secondo il seguente patch:
htdoc.diff
----------
18c18
< database_dir:        /usr/local/htdig/db/htdoc
---
> database_dir:        /usr/local/htdig/db/php
28c28
< local_urls:          http://localhost/htdig/htdoc/=/usr/local/apache/htdocs/htdig/htdoc/
---
> local_urls:          http://localhost/doc/php_manual_en/=/usr/local/apache/htdocs/doc/php_manual_en/
30c30
< start_url:           http://localhost/htdig/htdoc/
---
> start_url:           http://localhost/doc/php_manual_en/


# patch php.conf htdoc.diff
Hmm...  Looks like a normal diff to me...
Patching file php.conf using Plan A...
Hunk #1 succeeded at 18.
Hunk #2 succeeded at 28.
Hunk #3 succeeded at 30.
done
creiamo poi la directory atta a contenere i db ed avviamo l'indicizzazione:
# mkdir /usr/local/htdig/db/php
# /usr/local/htdig/bin/rundig -v -c /usr/local/htdig/conf/php.conf | less
Prepariamo una maschera HTML per la  ricerca, con un controllo di selezione singola sulla documentazione da ricercare, ad es. modificando il file search.html di default fornito con htdig:
search.diff
-----------
33a34,37
> Manual to search: <select name="config">
> <option>htdoc</option>
> <option>php</option>
> </select>
35d38
< <input type="hidden" name="config" value="htdig">


# cp search.html search.html.orig
# patch search.html search.diff
Hmm...  Looks like a normal diff to me...
Patching file search.html using Plan A...
Hunk #1 succeeded at 34.
Hunk #2 succeeded at 39.
done
Andare poi all'url: http://localhost/htdig/search.html

Il procedimento può essere esteso per ogni manuale che si vuole indicizzare.