HouseSpider Homepage Usage »Project »Download

HouseSpider installation for version 4.5

Developed by Keith L. Jackson - up to and including version 4.0 - and later by Hans Fr. Nordhaug. This version was released 21st of May 2004.

Required Files

The required files are: The files are found on the SourceForge project page but most likely you have already downloaded the jars.

Installing the HouseSpider Applet

The HouseSpider applet is in a jar that must be copied to the same host as your web site. In addition you need to copy one (or all) of the button jars to the same location. Make sure that you copy the file as binary format. A sample HTML tag is as follows:

<applet code="HouseSpider.class" archive="HouseSpider.jar,buttons.bevel.jar,i18n.jar" width="90%" height="200">
 <param name="URLStart" value="http://housespider.sourceforge.net/index.html">
 <param name="URLExclude" value="http://housespider.sourceforge.net/doc/ver40/">
 <param name="URLHelp" value="http://housespider.sourceforge.net/doc/ver45/">
 <param name="bgcolour" value="FFFFFF">
 <param name="fgcolour" value="666666">
 <param name="bgtextcolour" value="FFFFFF">
 <param name="textcolour" value="666666">
</applet>

code="HouseSpider.class" is required to identify the applet to run. The applet must be named "HouseSpider.class" (case matters), it will not work if its name is altered. The archive="HouseSpider.jar" is required and may include a path if you are calling the applet from a different directory The width and height should be set to a minimum of 450 * 200, I prefer width="90%" and height="200" or more. The applet (main) tag may be followed by:

Applet parameters (tags)

URLStart Optional. This sets the base URL for your site. If the "URL" parameter is not passed, the page with the applet tag is used for resolving relative URLs (this may be useful for testing online and offline copies of your search page.)
URLExclude Optional. Comma seperated list of URLS (which should be pointing to files in a sub directory of URLStart) to exclude. Defaults to null. Read "Excluding URLs or files" for more information.
FileExclude Optional. Comma seperated list of filenames to exclude. Defaults to null. Read "Excluding URLs or files" for more information.
URLHelp Optional. Location of documentation. Defaults to documentation of the SourceForge website.
bgcolour Optional. The background colour. Use a standard six-digit hexadecimal string (like in the <BODY BGCOLOR="FFFFFF"> tag) or HouseSpider will default to a white background.
fgcolour Optional. The foreground colour. Use a standard six-digit hexadecimal string (like in the <BODY BGCOLOR="FFFFFF"> tag) or HouseSpider will default to a black foreground.
bgtextcolour Optional. The background colour of the text. Use a standard six-digit hexadecimal string or HouseSpider will default to a white background.
textcolour Optional. The colour of the text. Use a standard six-digit hexadecimal string or HouseSpider will default to a black foreground.
urlcolour Optional. The color of the the label URL. Use a standard six-digit hexadecimal string or HouseSpider will default to black.
urlhcolour Optional. The color of the the label URL when hovering. Use a standard six-digit hexadecimal string or HouseSpider will default to blue.
urlacolour Optional. The color of the the label URL when active. Use a standard six-digit hexadecimal string or HouseSpider will default to red.
fontsize Optional. The font size of the text in the applet. Defaults to 12.
fontname Optional. The font name of the text in the applet. Defaults to "Dialog". Other standard choices are "Courier", "TimesRoman" or "Helvetica".
Lang Optional. This sets the language for th applet. Defaults to "en" (English).
IgnoreWords This set words to be ignored whe indexing/searching.
UseMetaKey Search meta keywords too. Defaults to "yes".
UseMetaDesc Display meta description in result list. Defaults to "no".
ShowURL Display the (relative) url in result list. Defaults to "no".
IndexPath Relative path to index file - used only (for reading and) if the index file is not in same directory as URLstart. (For saving use SaveDir - see Indexing.)
StatusLine Show statusline. Defaults to "yes".
StatusLabel Show statuslabel. Defaults to "yes".
Action Default action performed by the applet. Defaults to "index" of course. Other choice is "noindex". (more info)
MaxSearch Optional. Maximum number of hits for search. Defaults to 100.
Target Optional. The frame into which any page is opened. Defaults to "self".
TargetHelp Optional. The frame into which the help page is opened. Defaults to "self".
Targetsearch Optional. The frame into which the goto page is opened. Defaults to "self".
StatusLine Optional. Show status information below site list - defaults to "yes".
InitInput Optional. Initial input/search string. (May be used for batch indexing.) Defaults to "yes".
Debug Optional. Debugging level. (0 - no debugging, 4 - full debugging.) Defaults to 0. Debugging information is written to standard output or the Java console. Debugging is only useful when spider/indexing.

Now test your site and see how it works!

HouseSpider will by default use cache-searching, i.e, search the index-file. If it can't find the index-file (compressed or uncompressed) it will do a spider-search.

Excluding URLs or files

I will illustrate the usage of URLExclude and FileExclude by some examples. URLStart points to www.yourserver.com/somedir/start.html which contain the following links:

www.yourserver.com/somedir/dir1/page1.shtml
www.yourserver.com/somedir/dir1/page2.html
www.yourserver.com/somedir/dir2/page2.html
www.yourserver.com/somedir/dir2/page3.shtml
  1. Exclude www.yourserver.com/somedir/dir1/page1.shtml and www.yourserver.com/somedir/dir2/page3.shtml:
    <param name="URLExclude" value="http://www.yourserver.com/somedir/dir1/page1.shtml,
                                    http://www.yourserver.com/somedir/dir2/page3.shtml">
    

  2. Exclude all files in www.yourserver.com/somedir/dir1/, i.e., www.yourserver.com/somedir/dir1/page1.shtml and www.yourserver.com/somedir/dir1/page2.html:
    <param name="URLExclude" value="http://www.yourserver.com/somedir/dir1/">
    

  3. Exclude all files named page2.html, i.e, www.yourserver.com/somedir/dir1/page2.html and www.yourserver.com/somedir/dir2/page2.html:
    <param name="FileExclude" value="page2.html">
    

  4. Exclude all shtml-files, i.e, www.yourserver.com/somedir/dir1/page1.shtml and www.yourserver.com/somedir/dir2/page3.shtml:
    <param name="FileExclude" value=".shtml">
    

  5. Exclude all files/URLs linked to in www.yourserver.com/somedir/start.html:
    <param name="FileExclude" value="page2.html,.shtml">
    
    or
    <param name="URLExclude" value="http://www.yourserver.com/somedir/dir1/,
                                    http://www.yourserver.com/somedir/dir2/">
    

Indexing Your Web Site

You need to invoke the HouseSpider applet from a command line using appletviewer, something like "appletviewer indexsetup.html". (The appletviewer is part of the Java Runtime Environment.) In theory it should be possible to index from within the web browser, but because of the security restrictions on Java applets in web browsers it may not work. The file "indexsetup.html" should contain the following tags:

<html>
<body>
 <applet code="HouseSpider.class" codebase="/local/www/dir/jars/"
   archive="HouseSpider.jar,buttons.bevel.jar" width="400" height="200">
  <param name="URLStart" value="http://your.web.server/dir/startfile">
  <param name="URLExclude" value="http://your.web.server/dir/somedir/">
  <param name="SaveDir" value="/local/www/dir/">
 </applet>
</body>
</html>

where "SaveDir" tells where you want to save the index file "HouseSpider.index". "codebase" tells where the jar files are stored. The URL, "URLStart", may be to a local copy or remote copy of your site, but you will probably want to index a local copy of your site before you upload it to the server. Note that "URLExclude" is optional.

Type "houseindex" in the text input box to index your site. You may need to set the security mode for the appletviewer to unrestricted. Try the following settings:

If all goes well, two files will be generated: "HouseSpider.index" and "HouseSpider.log". Put these files in the "URLStart" directory, or in the same location as the applet page if "URLStart" is undefined. You may edit the log file, but keep all the text on one line. The contents of the log file is displayed in the status field of the applet at start up.

After indexing, a statistics report is generated that lists how many pages that where indexed and pages with missing title tags.

HouseSpider is capable of indexing 10000 web pages and 10000 key words. Memory is not the issue, but size of the index file is. I figure that an index file of 100k will still be acceptable, but anything larger will probably be too slow in loading.

Compressed index file

To generate a compressed index file type "houseindex-zip" in the text input box (in stead of "houseindex"). If all goes well, one file will be generated, "HouseSpider.index.zip". You may of course also zip an existing index file (with your favourite compressing tool).

HouseSpider will automatically use a compressed index file if it exists.

Advanced indexing commands

In the list of applet tags/parameters the Action tag was presented. By using

<param name="Action" value="noindex">

nothing is indexed on a page before the following tag is found: <!-- housespider index --> Then HouseSpider will index the rest of the page.

If you don't set the Action parameter or set it to "index" which is the default, the following is the case:

This text will be indexed
<!-- housespider noindex -->
This text will not be indexed
<!-- housespider index -->
And this will be indexed.

This is very useful if you don't want to index a menu (for example) that is common on all pages.

Index file format

From version 4.5 the index file contains four lines per page on your site:

  page contents
  title
  meta keywords##meta description
  relative URL

where "##" is a separator (and must be included even though meta description is empty).

You may build this file manually or using any other script of course. You may also edit the file after HouseSpider has generated it - useful if just a few characters are wrong or if you want to add a page that HouseSpider didn't find.

Miscellaneous

License

HouseSpider is free - not as in beer, but as in freedom.

HouseSpider is open source and being distributed under the GNU General Public license. If you decide to modify HouseSpider, and your modifications are improvements (as opposed to simple GUI modifications) please submit a patch on the SourceForge project page. Source code for this and later versions are found on the SourceForge project page.

Limitations

Known Problems

Acknowledgments

I thank Tim Tyler for the ImageLoader class.


Valid HTML 4.0!