Domesday is a tool to analyse html (and other) files in order to extract title and meta data from them to create either a hierarchical site map or an ordered list style index (such as an A-Z index).
Domesday has been around for a long time in the Windows world and has received great reviews. Now, we are completely rewriting IG in Java to be cross platform compatible.
Domesday is organised into projects. When a user wishes to create an index, they have to create a project settings file. This is a simple text file containing name/value pairs. Project settings include directories to scan for files, filters, output location, templates for the final index.
Eventually, we will have a gui to aid creation of the settings files. This will always be an optional extra - it will always be possible to edit the files by hand. Currently, I am wanting to write the gui in Java, but using the Java-Gnome Gtk Bindings. I have done some work on this, but am currently being held back as some major changes are currently being done in the Java-Gnome project (porting to Gtk2 and also changing much of the API to make it more 'java-like')
{@link Domesday} is the main Domesday application. This takes the settings file as a command line argument and generates the index.
{@link IGLog} is a file for logging progress, warnings and errors. This is used by every section of the program. Details of log levels are in the IGLog source. Additionally, many classes make use of a LOGLEVEL variable - this is simply so that you can give detailed debugging data, but remove this easily for the production releases by increasing the LOGLEVEL.
{@link Project} is for handling the project settings files. It is responsible for loading, saving, validating and retrieving the data.
{@link FileList} deals with the list of files to be indexed. These are all stored as {@link IGFile} classes, along with all their extracted details as needed. FileList generates the list from folders and filters in the settings; uses the parsers to extract the required details; performs sorting on the list (this is non-trivial for hierarchical site maps).
{@link FileParser} is the base class for parsing files to extract details such as titles and descriptions. The main implementation will be {@link HTMLParser}, for dealing with html files. Later, more parsers will be added possibly including PDF files (often requested); doc files; email (unix mailbox); ogg, mp3...
{@link AnyIndexGen} is the abstract base class for index generation. We currenly only have detailed plans for two dervied classes: {@link SitemapGen} for generating sitemap-style indices; and {@link OrderedListGen } for generating ordered lists.
Basically, the IndexGen classes take the file lists, join the details together with the templates in the settings and produce the output. {@link IGReplace} is a class which will be used by the Domesdays to replace the placeholders in the templates with the details from the file list.
This application will have full i18n support. The main implication for
programming is that for any strings to be displayed (e.g. in the log file), we
need to use messages.getString("Name")
, and add the string to the
*.properties file.
We aim to make this work on as many platforms as possible. Only Java cross platform code should be used (e.g. don't hard code path separators '/'). I will be developing using Debian GNU/Linux i386; we will certainly want to support Ms Windows; Eventually, I will be submitting this to the Debian project, which will require it to build on 11 architectures (or more as time progresses. Also, they are now working on supporting a number of alternative kernels (Linux, Hurd, BSD, Win32), but that is all for the future)