HouseSpider | Homepage | Usage | »Project | »Download |
---|
Developed by Keith L. Jackson - up to and including version 4.0 - and later by Hans Fr. Nordhaug. This version was released 22nd of May 2005.
Contents:
The HouseSpider applet is in a jar that must be copied to the same host as your web site. In addition you need to copy one (or all) of the button jars to the same location. Make sure that you copy the file as binary format. A sample HTML tag is as follows:
<applet code="HouseSpider.class" archive="HouseSpider.jar,buttons.bevel.jar" width="90%" height="200"> <param name="URLStart" value="http://housespider.sourceforge.net/index.html"> <param name="URLExclude" value="http://housespider.sourceforge.net/doc/ver40/"> <param name="URLHelp" value="http://housespider.sourceforge.net/doc/ver44/"> <param name="bgcolour" value="FFFFFF"> <param name="fgcolour" value="666666"> <param name="bgtextcolour" value="FFFFFF"> <param name="textcolour" value="666666"> </applet>
The code="HouseSpider.class" is required identify the applet to run. The applet must be named "HouseSpider.class" (case matters), it will not work if its name is altered. The archive="HouseSpider.jar" is required and may include a path if you are calling the applet from a different directory The width and height should be set to a minimum of 450 * 200, I prefer width="90%" and height="200" or more. The applet (main) tag may be followed by:
URLStart | Optional. This sets the base URL for your site. If the "URL" parameter is not passed, the page with the applet tag is used for resolving relative URLs (this may be useful for testing online and offline copies of your search page.) |
URLExclude | Optional. Comma seperated list of URLS (which should be pointing to files in a sub directory of URLStart) to exclude. Defaults to null. Read "Excluding URLs or Files" for more information. |
FileExclude | Optional. Comma seperated list of filenames to exclude. Defaults to null. Read "Excluding URLs or Files" for more information. |
KeepIdxPages | Optional. Keep or remove default index pages from URLs. Read "Writing Good URLs" for more information. |
URLHelp | Optional. Location of documentation. Defaults to documentation of the SourceForge website. |
bgcolour | Optional. The background colour. Use a standard six-digit hexadecimal
string (like in the <BODY BGCOLOR="FFFFFF"> tag)
or HouseSpider will default to a white background. |
fgcolour | Optional. The foreground colour. Use a standard six-digit hexadecimal
string (like in the <BODY BGCOLOR="FFFFFF"> tag)
or HouseSpider will default to a black foreground. |
bgtextcolour | Optional. The background colour of the text. Use a standard six-digit hexadecimal string or HouseSpider will default to a white background. |
textcolour | Optional. The colour of the text. Use a standard six-digit hexadecimal string or HouseSpider will default to a black foreground. |
urlcolour | Optional. The color of the the label URL. Use a standard six-digit hexadecimal string or HouseSpider will default to black. |
urlhcolour | Optional. The color of the the label URL when hovering. Use a standard six-digit hexadecimal string or HouseSpider will default to blue. |
urlacolour | Optional. The color of the the label URL when active. Use a standard six-digit hexadecimal string or HouseSpider will default to red. |
fontsize | Optional. The font size of the text in the applet. Defaults to 12. |
fontname | Optional. The font name of the text in the applet. Defaults to "Dialog". Other standard choices are "Courier", "TimesRoman" or "Helvetica". |
Lang | Optional. This sets the language for the applet. Defaults to "en" (English). |
Charset | Optional. This sets the charset used for reading the web pages (and hence the index file). Defaults to "ISO-8859-1". |
IgnoreWords | This set words to be ignored whe indexing/searching. |
UseMetaKey | Search meta keywords too. Defaults to "yes". |
UseMetaDesc | Display meta description in result list. Defaults to "no". |
ShowURL | Display the (relative) url in result list. Defaults to "no". |
IndexPath | Relative path to index file - used only (for reading and) if the index file is not in same directory as URLstart. (For saving use SaveDir - see Indexing.) |
StatusLine | Show statusline. Defaults to "yes". |
StatusLabel | Show statuslabel. Defaults to "yes". |
Action | Default action performed by the applet. Defaults to "index" of course. Other choice is "noindex". (more info) |
MaxSearch | Optional. Maximum number of hits for search. Defaults to 100. |
Target | Optional. The frame into which any page is opened. Defaults to "self". |
TargetHelp | Optional. The frame into which the help page is opened. Defaults to "self". |
Targetsearch | Optional. The frame into which the goto page is opened. Defaults to "self". |
StatusLine | Optional. Show status information below site list - defaults to "yes". |
InitInput | Optional. Initial input/search string. (May be used for batch indexing.) Defaults to "yes". If the string starts with "wait:" the input will be put in the text input box, but the search is not started (before the user click the magnifying glass or hits the return button). |
Debug | Optional. Debugging level. (0 - no debugging, 4 - full debugging.) Defaults to 0. Debugging information is written to standard output or the Java console. Debugging is only useful when spider/indexing. |
Now test your site and see how it works!
HouseSpider will by default use cache-searching, i.e, search the index-file. If it can't find the index-file (compressed or uncompressed) it will do a spider-search.
I'll only focus on two aspects - default index pages and missing trailing slashes - because they affect the behaviour of HouseSpider.
http://www.google.com/index.htmlequivalent to (the much nicer and shorter URL)
http://www.google.com/I prefer to always use the shorter and nicer version, but the real problem is that HouseSpider doesn't know the difference between the two URLs - and will treat them as two different pages if you happen to write your URLs inconsistly. However, as long as KeepIdxPages is set to "no" (which is the default) HouseSpider will strip off all default index pages from the URL it finds.
http://housespider.sf.net/doc/ver47in stead of the correct
http://housespider.sf.net/doc/ver47/What actually happens is that when a webserver get at a request for a URL with a missing trailing slash, it redirects to the correct URL with the trailing slash. HouseSpider tries to catch URLs with missing trailing slashes and correct them. (If HouseSpider fails to catch such missing trailing slashes, you'll get wrong URLs on pages that use relative URLs - one directory will be missing in the URLs.)
I will illustrate the usage of URLExclude and FileExclude by some examples. URLStart points to www.yourserver.com/somedir/start.html which contain the following links:
www.yourserver.com/somedir/dir1/page1.shtml www.yourserver.com/somedir/dir1/page2.html www.yourserver.com/somedir/dir2/page2.html www.yourserver.com/somedir/dir2/page3.shtml
<param name="URLExclude" value="http://www.yourserver.com/somedir/dir1/page1.shtml, http://www.yourserver.com/somedir/dir2/page3.shtml">
<param name="URLExclude" value="http://www.yourserver.com/somedir/dir1/">
<param name="FileExclude" value="page2.html">
<param name="FileExclude" value=".shtml">
<param name="FileExclude" value="page2.html,.shtml">or
<param name="URLExclude" value="http://www.yourserver.com/somedir/dir1/, http://www.yourserver.com/somedir/dir2/">
To index yor web site, run HouseSpider as a stand-alone application. Start the application as follows
java -cp HouseSpider.jar;buttons.bevel.jar;i18n.jar HouseSpider <name>=<value> ...
where the legal name-value pairs are selected from the list of applet parameters. Values with space must be enclosed in quotes. The only parameter that must be set is URLStart. Other useful parameters are SaveDir, Debug, the file and URL excluding parameters and finally InitInput for batch indexing. (You may of course replace "buttons.bevel.jar" with "buttons.flat.jar" or "buttons.glass.jar" in the above command.) Notice that on Linux you should use colon, not semi-colon, between the jar-files. (If the jar-files aren't in the current directory, you add the path before each jar-file.)
Type "houseindex" in the text input box to index your site. If all goes well, two files will be generated: "HouseSpider.index" and "HouseSpider.log". Put these files in the "URLStart" directory, or in the same location as the applet page if "URLStart" is undefined. You may edit the log file, but keep all the text on one line. (Also remember that the log file is and must be using the UTF-8 charset.) The contents of the log file is displayed in the status field of the applet at start up.
After indexing, a statistics report is generated that lists how many pages that where indexed and pages with missing title tags.
HouseSpider is capable of indexing 10000 web pages and 10000 key words. Memory is not the issue, but size of the index file is. I figure that an index file of 100k will still be acceptable, but anything larger will probably be too slow in loading.
To generate a compressed index file type "houseindex-zip" in the text input box (in stead of "houseindex"). If all goes well, one file will be generated, "HouseSpider.index.zip". You may of course also zip an existing index file (with your favourite compressing tool).
HouseSpider will automatically use a compressed index file if it exists.
In the list of applet tags/parameters the Action tag was presented. By using
<param name="Action" value="noindex">
nothing is indexed on a page before the following tag is found:
<!-- housespider index -->
Then HouseSpider
will index the rest of the page.
If you don't set the Action parameter or set it to "index" which is the default, the following is the case:
This text will be indexed <!-- housespider noindex --> This text will not be indexed <!-- housespider index --> And this will be indexed.
This is very useful if you don't want to index a menu (for example) that is common on all pages of your website.
From version 4.5 the index file contains four lines per page on your site:
page contents title meta keywords##meta description relative URL
where "##" is a separator (and must be included even though meta description is empty).
You may build this file manually or using any other script of course. You may also edit the file after HouseSpider has generated it - useful if just a few characters are wrong or if you want to add a page that HouseSpider didn't find.
HouseSpider is a little picky about the quotes in links in your HTML files:
<a href="URL">...</a>is OK
<a href='URL'>...</a>is also OK
<a href=URL>...</a>is not OK
The above links are just examples - it generalizes to frames and iframes. By "not OK" I mean that HouseSpider will simply ignore these links. Style 3 is not recommended by W3C so that I won't support.
HouseSpider just scans the text of your HTML files for patterns that represents links. As a result all links hidden inside or defined by script languages (as JavaScript) are not found. The following types are supported:
<a href="URL">Some text</a>
<frame src="URL">
<iframe src="URL">
You may of course add the standard parameters to these tags in any order - for example
<a class="some_class" href="URL">Some text</a>is perfectly fine. Note that HouseSpider will index the files included in (i)frames as stand alone files - not as a part of the page that contains the (i)frame.
HouseSpider is free - not as in beer, but as in freedom.
HouseSpider is open source and being distributed under the GNU General Public license. If you decide to modify HouseSpider, and your modifications are improvements (as opposed to simple GUI modifications) please submit a patch on the SourceForge project page.
I thank Tim Tyler for the ImageLoader class.