The query syntax in ICE
Subject to change
Although the traversal of hypertext links is the main navigation paradigm
of the WWW, the HTTP protocol also for suppplying keywords and passing
them to the server.
This mechanism can be used to implement free text on a web archive.
The ICE indexing extension is a lean, easy to install free text indexer
for web servers.
It not only implements simple keyword searches, but has a complex query language which can be used for retrieving documents and definitions.
The use of thesauri also allows for operating on "concepts".
The easiest query involves specifying a keyword: A query with the keyword "picture" will simply retrieve a list of documents containing the string "picture". To speed up the process of searching through documents, the ICE indexer uses an inverted index of words. It contains a list of every word in the document and it's frequency. Searches are case insensitive, so it does not matter if e.g. a word starts with a capital letter. Also, a query only looks for strings as words, this prevents results like retrieving "cartoon" when the query is supposed to look for "car".
Keywords may also contain wildcards that allow for selecting all strings matching a specific pattern. The ICE indexer implements "ed"-style regular
expressions. The regular expression syntax is the following:
- Any character except a special character matches
itself. Special characters are the regular expression
delimiter plus \[. and sometimes ^*$.
- A . matches any character.
- A \ followed by any character except a digit or ()
matches that character.
- A nonempty string s bracketed [s] (or [^s]) matches any
character in (or not in) s. In s, \ has no special
meaning, and ] may only appear as the first letter. A
substring a-b, with a and b in ascending ASCII order,
stands for the inclusive range of ASCII characters.
- A regular expression of form 1-4 followed by * matches
a sequence of 0 or more matches of the regular
expression.
Combining keywords
The combination of two or more keywords can be used to create more complex queries. Combining query terms with the operator "OR" retrieves files containing either term. Connecting the keywords using the "AND" operator will limit the search to those documents containing both strings: By issuing the query "computer AND picture", only documents containing both words will be retrieved. Such a query will retrieve mostly documents that deal with computer generated or manipulated images (although it will also find text fragments like "The book gives a detailed picture of what's going on inside a computer"). Using the "OR" operator will retrieve
documents containing either words. Queries can be built up using multiple keywords connected with AND and OR, the AND operator has precedence.
Using the Thesaurus
Another important feature is the use of a built-in thesaurus. Usually, a thesaurus is used to find synonyms to words, but a thesaurus is much more than a synonym list. It is a semantic network containing concepts that are related to one another in various ways. Since thesauri allow for dealing with a concept instead of just a single keyword, they can be very useful in free text queries. A simple example is the fact that the words "image" and "picture" are often used synonymously, so a query for "picture" should also retrieve files containing the string "image". Other relations between concepts are "is an abbreviation for" and "is a broader term for". Being able to identify terms related to a certain concept makes it possible to find information that is difficult or impossible to select using simple keyword search.
Since there can be situations where the automatic use of a thesaurus is not wanted, it can be switched on and off by using a special notation in the query string. For example, typing "{picture}" will retrieve all documents containing either the word picture or related concepts (it could be read as "retrieve the set of all concepts related to picture"), whereas the query string "picture" only retrieves documents that contain the literal string.
The thesaurus can be switched on and off for all words in the query by
adding a "-T" to the keyword expression.
Other options
By default, ICE only matches whole words (so that searching
fo "ice" will not match "rice". By supplying a "-S" after the
keyword expression, matching is extended to substrings.
Searches can be limited to documents that have changed
during a specific time interval. By adding "-D" followed by
a number, only documents that have changed during n/
days will be found. Example: "-D 7" finds all files that
have changed during the last week.
The options "-T", "-D n" and "-S" have an easy to use equivalent
in the forms interface "ice-form.pl", so you only have to type them
if your browser does not support forms.
Defining a context
Another means of refining a search query is by using a document relative context, which defines a specific subset of documents.
This context is based on URL hierarchies, since they are often
(and should be) used to reflect logical structures.
Issuing the query
"picture AND digital -T -D 7 @ /icib/video"
will limit the search to those documents with a URL starting in /icib/video.