Merquery: Python full text indexing

There have been a few posts recently related to the creation of a Python abstraction layer for full text indexers, named Merquery.

Merquery is particulary interesting to me, as I started a similar module (named pyndexter, prounounced poindexter) after my experiences writing the Trac RepoSearchPlugin. The idea being that I would eventually port the plugin to this API in order to benefit from the efficiency of existing indexers.

The initial design I came up with for pyndexter consisted of the following high-level concepts:

URI
Each document is uniquely identified by a URI. eg. file:///home/athomas/doc/some_doc.txt, mysql://username:password@host/database/table/, etc.
Document
A document is essentially just a block of text with a number of associated attributes, uniquely identified by its URI. Depending on the source of the document, it could contain additional attributes such as database column information, etc.
Document Source
A document source is a class that knows how to retrieve documents for a specific URI scheme, determine whether a document needs to be reindexed, and traverse documents within the scheme. eg. a FileSource object for file://, a MySQLSource for mysql://, and so on.
Indexer
The indexer, of course, performs the indexing of documents. It accepts a document object when indexing, and returns a set of URIs matching a search term when searching. Each indexing engine would have its own subclass of a base indexer class, customising its behaviour appropriately. Some indexers may have limitations on the way they accept data, in which case only a subset of the ideal would be available. eg. An indexer that can only index local files would only accept document objects using the file:// scheme.

The common case? Searching files

In the common case where you just want to index some files, simply instantiate a FileSource and pass it to an indexer:

import os
from pyndexter import *
from pyndexter.hyperestraier import HyperestraierIndexer
from pyndexter.file import FileSource

docs = FileSource(os.getcwd(), include=['*.py'])
indexer = HyperestraierIndexer('indexer.idx', docs)
indexer.update()

search = indexer.search(u'HyperestraierIndexer')
print len(search), 'hits'
for hit in search:
    doc = hit.document
    print hit.uri, doc.size,
    if hit.score:
        print hit.score,
    print  doc.attributes.keys()

indexer.close()

Extensibility

For an application that needs to index custom data, it can either instantiate an indexer and feed it documents generated on the fly, or subclass the base DocumentSource class and implement its own URI scheme. In the former case the application can use its own unique document identifiers, the indexer doesn't care.

Something like the following might be sufficient for a mythical RepoSearchPlugin replacement:

from pyndexter import Document
from pyndexter.hyperestraier import HyperestraierIndexer

repo = self.env.get_repository(req.authname)

def walk_repo(node):
    if node.kind == Node.FILE:
        yield node
    elif node.kind == Node.DIRECTORY:
        for subnode in node.get_entries():
            for result in walk_repo(subnode):
                yield result

# Index the repository
hype = HyperestraierIndexer('/some/path/to/index/store')
for node in walk_repo(repo.get_node('/')):
    doc = Document(node.path, node.get_content())
    hype.index(doc)

# Search for some terms
for path in hype.search(u'cheese is good'):
    print path

As documents are being passed to the indexer manually, the caller will have to take care of purging invalid documents from the indexer.

Code

You can browse the source here, download a ZIP from here or check out the source with:

svn co http://swapoff.org/svn/pyndexter/trunk pyndexter-trunk

The example above should work fine.

It has adapters for Hyperestraier (via Hype) and Xapian (via Xapwrap).

To use the Xapain adapter, just s/Hyperestraier/Xapien/g and s/hyperestraier/xapien/g, then make sure you clear out the previous indexer.idx.

For the record, I much prefer Hype for both the intuitive and well designed API, and the indexing speed.