Inverted Index
Digital Creations, L.C.
910 Princess Anne Street, Suite 300
Fredericksburg, VA 22405
http://www.digicool.com/
Abstract
Text-search engines provide the ability to quickly search for
documents that contain terms in textual data. Commercial
text-search engines provide many capabilities but often have
large overhead in process size, application complexity, and
licensing cost. We have developed a Python module,
InvertedIndex, that has a simple Python Application Programming
Interface and that provides support for incremental indexing
and for stopword, synonym, and stemming databases.
Introduction
The problem of document storage and retrieval has always been a major
issue in Computer Science. One of the most popular techniques of
information retrieval has been the use of inverted indices, the method
used by most commercial indexing software companies.
An inverted index catalogs a collection of objects in their
textual representations. Given a set of documents, keywords and other
attributes (possibly including relevance ranking) are assigned
to each document. The inverted index is the list of keywords and
links to the corresponding document. Frequently there are several
restrictions which limit the keywords in an index. Examples include:
- A collection of stopwords--keywords that are not considered
relevant. This collection normally contains words considered
too common to function as keywords (articles, prepositions,
conjunctions, etc.) or words outside the context of the index.
- A set of rules that define keywords in a document. This controls
the manner in which keywords are found. For instance, keywords
might be defined as character sequences surrounded by white-
space.
- A set of rules that restrict indexable words. For instance, a
rule that causes keywords containing numbers not to be indexed.
- A collection of "synonyms"--keywords that should be indexed using
a different keyword.
The InvertedIndex module provides simple tools for creating and
maintaining inverted indices with support for incremental indexing and
for stopword, synonym and stemming databases.
Features
Lexical Analysis for Indexing
Lexical analysis is the process of converting text input into
a group of words or tokens. Common issues include handling of
hyphenated words, punctuation, digits and case. InvertedIndex
provides basic functionality for addressing these issues and
is configurable for special behavior. By default, InvertedIndex
de-hyphenates all words and treats groups of alphabetic
characters as words, using whitespace and punctuation as a
keyword delimiter. The words are converted to lowercase as
they are indexed.
Stopwords
The stopword list is a group of keywords which should not be
indexed. It is comprised of keywords that are considered irrelevant
to the index--keywords considered too common or non-specific
to be useful as search items. Common stopwords include articles,
prepositions and one-letter words.
At index time, keywords are compared the words in the stopword list;
any keyword found in the stopword list is not indexed. Stopword
lists can dramatically reduce the size of the index, decreasing
the overall search time. Since irrelevant words are not included
in the index, search results are likely to be more meaningful in
the context of the user's query.
Stemming
Stemming is a method of providing users with the ability to
execute a search on a word using an alternate grammatical form,
such as tense and person. Rather than using an automatic stemming
algorithm to create the stems, InvertedIndex uses a table-lookup
method, storing the stems along with the index in a Python
dictionary. While the table-lookup method is costly in terms of
storage space, it allows for quicker indexing and greater control
over the formation of stems.
The following is an example of a mapping of keywords to their
corresponding stem keywords:
- running -> run
- runner -> run
- swims -> swim
- swam -> swim
At index time, keywords for which a stem is given will be indexed
as the stem keyword, not as the original keyword. Stemming is
helpful in reducing the size of the index files. However, it
can also negatively affect the speed of retrieval if the stemming
database is large.
Synonyms
A synonym database can be constructed to search on words with
equivalent meanings. Synonyms are implemented exactly as stems,
the only distinction is that keywords are mapped to words with
the same meaning as opposed to different grammatical forms of
the keyword.
The following is an example of a mapping of keywords to their
corresponding synonyms:
- car -> vehicle
- automobile -> vehicle
- plane -> vehicle
Keywords for which a synonym is given will be indexed as the
synonym keyword, not as the original keyword. As with stemming,
synonyms can reduce the size of the index, but can also slow
the indexing and searching time.
Persistence
Persistent indices allow for quick retrieval of previously
indexed information. Persistence is achieved through use of
the Python pickle module. This feature allows an index
to be saved to disk and later restored, avoiding having to
reindex data.
Persistent indices are provided through the
InvertedIndex module's Persistent and Transactional classes.
The Persistent class keeps track of whether the index has
changed, saving the object's data when necessary. The
Transactional class provides a transactional model for
persistency: changes to an index are only saved at the
completion of a 'transaction', a single, complete operation on the
index. The value of the transactional model is that is insures
that no portion of an aborted indexing attempt is saved.
Relevance Ranking
Keywords are assigned a relevance ranking, which is the calculation
of a keyword's frequency relative to the total number of words in
the document. InvertedIndex assigns ranking based on the following
formula:
O / log(T)
where O represents the number of occurrences of a single keyword
and T represents the total number of keywords in the document.
The relevance rank aids the end-user in locating the desired
information by indicating which results (especially in a large
result set) are more likely yield pertinent information.
Query Language
A separate module, InvertedIndexQuery, provides a simple query
language for use with InvertedIndex. InvertedIndexQuery supports
"and", "or", and "not" operators on search terms as well as the use
of parentheses for association. The following examples illustrate
several inverted index queries along with explanations with their
functions.
Query | Result
|
"Python"
| Documents containing the word "Python"
|
"Python or spam"
| Documents containing either the word "Python" or the
word "spam"
|
"Python and not Perl"
| Documents containing the word "Python", but not the word
"Perl"
|
"not (Pascal or Perl)"
| Documents that contain neither of the words "Pascal" or
"Perl"
|
"not ((Pascal or Perl) and useful)
| Documents that do not contain both the words "useful"
and "Pascal" or the words "useful" and "Perl"
|
Since InvertedIndexQuery is a separate module, implementing
an alternate query parser is a simple matter.
Incremental Indexing
Documents may be added to an existing index without the need
to reindex the original data. This allows indexing to be
split into smaller jobs that require less time.
Architecture
Index Object
The Index object provides the basic functionality for indexing
and searching. Indexing is done using the object's index()
method. The Index object behaves as a dictionary, mapping
keywords to their corresponding search results, stopwords to
None, and stems and synonyms to their corresponding keywords.
Compiled regular expressions may be used as keys; the result of
a search done with a regular expression is a series of searches
performed on each matching keyword, their results joined by
an "or" operation.
Pickle Dictionary
Persistent indices make use of a "pickle dictionary", which
is contained in the Index object. The pickle dictionary
intelligently manages saving and loading of index values, only
pickling and unpickling individual index items as is necessary.
The pickle dictionary minimizes memory usage and load time
by using a specialized unpickling scheme[1].
Result List Object
The ResultList object is the instance object returned as a result
of a successful search. It contains a list of pairs that consist
of the relevance ranking and id of each document returned by a
search. The ResultList objects behave as sequences returning a
document pair corresponding to the index. ResultList objects
support "and", "or", and "not" not operations as well as sorting
of their document pairs by relevance. When an "and" operation
is performed on ResultList objects, corresponding relevance
rankings are added. The following shows how composite relevance
rankings are calculated when the above operators are used.
Operation | Method of Calculation
|
"or"
| sum of rankings
|
"and"
| geometric mean of rankings
|
"not"
| sum of rankings
|
ResultList objects can be passed into other search mechanisms to
allow for specialized searching, such as mixed full-text and
fielded searching.
Sample Usage
Below is example code which instantiates an Index object,indexes
two files, and prints the result of a search on the index.
import InvertedIndex
import InvertedIndexQuery
i = InvertedIndex.Index()
filename = 'document1.txt'
file_to_index = open(filename).read()
document_key = filename
# index the document, using document_key as the document's
# id.
i.index(file_to_index, document_key)
filename = 'document2.txt'
file_to_index = open(filename).read()
document_key = filename
i.index(file_to_index, document_key)
search_results = InvertedIndexQuery.query('Python and spam', i)
search_results.sort()
cnt = 0
for document in search_results:
cnt = cnt + 1
print '%d) %s' % (cnt, document[1])
Current Status
InvertedIndex is freely-available. It has been tested on the
following platforms: Solaris 2.4 and Digital UNIX 3.2.
It is being used in Digital Creations' new Classified product.
Future
The next release of InvertedIndex will contain support for
proximity searching, which is the ability to search for keywords
"near" each other in the indexed document. For instance,
a search for the words "Python" and "rules" would return
a document containing,
- Everybody knows that Python rules!
but not a document containing,
- Python! Everybody knows that rules!
References
- Providing Persistence for World Wide Web Applications
http://www.digicool.com/papers/Persistence.html