Keith L. Jackson, Friday September 10, 1999
Slightly modified and published on SourceForge by Hans Fr. Nordhaug, January 2001
Is your web site HouseSpider ready? HouseSpider requires your site to use lower case tags. Earlier versions were case insensitive, this version is not because XHTML will also be case sensitive- so get used to it. Specifically, HouseIndexer requires three tags to be in lower case and character entities are case sensitive:
HouseSpider 3.1 offers international interface.
Finnally, your site needs to have key words. These are added to your page by the meta tag shown above. Key words must be separated by a comma and may be multi-word phrases (e.g., "Ichthyology, news letters, fish"). If you do not want key words associated with a certain page, the page still requires an empty meta tag (like the one shown above).
If your site meets these requirments, you are ready to move on to the next stage...
You need to invoke the HouseIndexer application from a command line, something like "jre -cp HouseIndexer.jar HouseIndexer" (the latest Java Runtime Environment is required). If all goes well you will meet a window that looks something like this:
In the "Start URL" text field, enter the URL of the main page of your web site. This URL may be to a local or remote copy of your site, but you will probably want to index a local copy of your site before you upload it to the server (that is why "file:///" is already entered for you.) The "Max..Per Key" pop-up menu sets how many pages are allowed per key word. For now, leave it set at 20. Click "Index" to index your site. Any error and warning messages will be shown in the text box.
If all goes well, two files will be generated, "HouseSpider.index" and "HouseSpider.txt". If you indexed a local copy of your site, these two files will be automatically placed in the appropriate directory, the same directory as the "Start URL". If you indexed a remote site, the two files will be placed in the same directory as the HouseIndexer application and you will have to move them to the same directory as the "Start URL".
The HouseSpider applet is in a jar that must be copied to the same host as your web site. Make sure that you copy the file as binary format. A sample HTML tag is as follows:
<applet code="HouseSpider.class" archive="../java/HouseSpider.jar" width="90%" height="200"> <param name="URLbase" value="http://www.biology.ualberta.ca/jackson.hp/IWR/"> <param name="lang" value="de"> <param name="bgcolour" value="EEA1F0"> </applet>
The code="HouseSpider.class" is required identify the applet to run. The applet must be named "HouseSpider.class" (case matters), it will not work if its name is altered. The archive="../java/HouseSpider.jar" is required and may include a path if you are calling the applet from a different directory (or substitute "/java" with whatever directory you choose to put the HouseSpider.class file in). The width and height should be set to a minimum of 450 * 200, I prefer width="90%" and height="200" or more. The applet tag may be followed by three optional parameters:
Check out the applet tag in this page.
Now test your site and see how it works! The index file written by HouseIndexer uses relative URLs, so you can test your HouseSpider page online or offline.
In HouseIndexer, you can set the maximum number of sites per key word. Initially this value is set at 20. This seems to me to be a good number, think about it- how many book indeces have more than 20 pages per index entry. If, for some reason, you should want to override this limit, you may set it as high as 100. Any key words that occur more than the limit you set will not be indexed. This is useful in avoiding key words that occur on all your pages (ie, key words for other web spiders like AltaVista).
If you have HouseIndexer working your site over like a finely tuned machine, consider enabling the auto feature. As you may have guessed, it enables HouseIndexer to begin indexing immediately after launch and to quit after finishing indexing. If there are any errors during indexing, there will be a few second pause before autoquitting so you can read the error message(s).
Speaking of error messages, here are the ones you may encounter and their solutions:
After indexing, a key word statistics report is generated that lists all key words that were indexed, how many pages they occurred in, and key words that were not indexed because they occurred in more than the specified number of pages.
HouseIndexer is capable of indexing 10000 web pages and 10000 key words. Memory is not the issue, but size of the index file is. My Ichthyology site contained (1999) over 400 pages at the time of this writing. Its index file is only 30k. I figure that an index file of 100k will still be acceptable, but anything larger will probably be too slow in loading. If you are pushing the 100k index file, and still find performance acceptable, I would like to hear about it!
HouseSpider and HouseIndexer are free.
HouseSpider and HouseIndexer are open source and being distributed under the GNU General Public license. If you decide to modify HouseSpider and/or HouseIndexer, and your modifications are improvements (as opposed to simple GUI modifications) please submit a patch on the SourceForge Project page
Would you like to see the HouseSpider Applet interface in another language, translate the following and submit it on the SourceForge Project page:
... and remember, special characters are usable! (German, english, norwegian, spanish and dutch is already supported.)
I thank Rutger Hemrika, Nicole Josan, and Horacio Salazar for Dutch, German, and Spanish translations.