This guide describes usage of Hyper Estraier's web crawler. If you haven't read user's guide and P2P guide yet, now is a good moment to do so.
estcmd can index files on local file system only. Though files on remote hosts can be indexed by using NFS or SMB remote mount mechanism, unspecified number of web sites on Internet can not be mounted by them. Though such web crawlers as wget can do prefetch of those files, it involves high overhead and wastes much disk space.
The command estwaver is useful to crawl arbitrary web sites and to index their documents directly. estwaver is so intelligent that it supports not only depth first order and width first but also similarity oriented order. It crawls documents similar to specified seed documents preferentially.
First step is creation of the crawler root directory which contains a configuration file and some databases. Following command will create casket, the crawler root directory:
estwaver init casket
By default, the configuration is to start crawling at the project page of Hyper Estraier. Let's try it as it is:
estwaver crawl casket
Then, documents are fetched one after another and they are indexed into the index. To stop the operation, you can press Ctrl-C on terminal.
When the operation finishes, there is a directory _index in the crawler root directory. It is an index which can be treated with estcmd and so on. Let's try to search the index as with the following command:
estcmd search -vs casket/_index "hyper estraier"
If you want to resume the crawling operation, perform estwaver crawl again.
This section describes specification of estwaver, whose purpose is to index documents on the Web.
estwaver is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument rootdir specifies the crawler root directory which contains configuration file and so on.
All sub commands return 0 if the operation is success, else return 1. A running crawler finishes with closing the database when it catches the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM).
When crawling finishes, there is a directory _index in the crawler root directory. It is an index available by estcmd and so on.
The crawler root directory contains the following files and directories.
The configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.
seed: 1.5|http://hyperestraier.sourceforge.net/uguide-en.html
seed: 1.0|http://hyperestraier.sourceforge.net/pguide-en.html
seed: 1.0|http://hyperestraier.sourceforge.net/nguide-en.html
seed: 0.0|http://qdbm.sourceforge.net/
proxyhost:
proxyport:
interval: 500
timeout: 30
strategy: 0
inherit: 0.4
seeddepth: 0
maxdepth: 20
masscheck: 500
queuesize: 50000
replace: ^http://127.0.0.1/{{!}}http://localhost/
allowrx: ^http://
denyrx: \.(css|js|csv|tsv|log|md5|crc|conf|ini|inf|lnk|sys|tmp|bak)$
denyrx: \.(zip|tar|tgz|gz|bz2|tbz2|z|lha|lzh)(\?.*)?$
denyrx: ://(localhost|[a-z]*\.localdomain|127\.0\.0\.1)/
noidxrx: /\?[a-z]=[a-z](;|$)
urlrule: \.est${{!}}text/x-estraier-draft
urlrule: \.(eml|mime|mht|mhtml)${{!}}message/rfc822
typerule: ^text/x-estraier-draft${{!}}[DRAFT]
typerule: ^text/plain${{!}}[TEXT]
typerule: ^(text/html|application/xhtml+xml)${{!}}[HTML]
typerule: ^message/rfc822${{!}}[MIME]
language: 0
textlimit: 128
seedkeynum: 256
savekeynum: 32
threadnum: 10
docnum: 10000
period: 10000s
revisit: 7d
cachesize: 256
#nodeserv: 1|http://admin:admin@localhost:1978/node/node1
#nodeserv: 2|http://admin:admin@localhost:1978/node/node2
#nodeserv: 3|http://admin:admin@localhost:1978/node/node3
logfile: _log
loglevel: 2
draftdir:
entitydir:
postproc:
Meaning of each variable is the following.
|". This can be more than once.|". This can be more than once.allowrx, denyrx, and noidxrx are evaluated in the order of description. Alphabetical characters are case-insensitive.
Arbitrary filter commands can be specified with typerule. The interface of filter command is same as with -fx option of estcmd gather. For example, the following specifies to process PDF documents.
typerule: ^application/pdf${{!}H@/usr/local/share/hyperestraier/filter/estfxpdftohtml