pyndexter Documentation

Query Syntax

pyndexter presents the user with a uniform query syntax, regardless of the backend indexer. The syntax is very similar to Google's.

Currently the following syntax is supported:

<term> <term>
Match documents containing all terms.
<term> or <term>
Match documents containing any terms.
<term1> -<term2>
Match documents containing <term1> and not <term2>.

API Documentation

Core API

Sources

Document sources maintain state about documents under their domain. They are used by the Framework to automatically update documents that have changed.

Indexers

Following is a list of the full-text indexing engines supported by Pyndexter.

I've placed them roughly in the order that I would recommend they be used, with the exception of the builtin indexer which is first only because it has no external dependencies and is thus very easy to use.

Stemmers

Examples

Indexing in 9 lines

To begin with, we'll index all .txt and .html files in /usr/share/doc, then search for linux OR windows.

import os
from pyndexter import Framework

framework = Framework('builtin:///tmp/pyndexter.idx')
framework.add_source('file:///usr/share/doc?include=*.txt&include=*.html')
framework.update()

for hit in framework.search('linux or windows'):
	print hit

framework.close()

Stemming

Stemming is easily enabled, although some indexers don't yet support stemming (or implement their own).

from pyndexter import Framework

framework = Framework('builtin:///tmp/pyndexter.idx', stemmer='porter://')

...

Excerpts

Once you've done a search and have a Hit object, a common requirement is to display an excerpt from the document. Hit objects provide this facility by extracting text containing as many terms as possible:

...

query = Query('linux or windows')
for hit in framework.search(query):
	print hit.uri
	print hit.excerpt(query.terms())

...

Retrieving a subset of the search results

A search can potentially return a very large set of results and iterating over the entire set can be time consuming. Pyndexter consequently provides result set slicing and indexing although, again, not all indexers support this operation.

...

results = framework.search('linux or windows')
print list(results[4:10])
print list(results[5])

...

Quoting

Due to the use of URI's to describe resources and components, characters such as #, ? and & are reserved. If these characters are in a path component, or other variable passed to Pyndexter, trouble may arise.

For example, if you have IRC logs in a directory named #channel and you wish to index these documents, you might naively think the following would work:

framework.add_source('file:///var/log/irc/#channel')

But of course, the #channel is treated as a URI fragment and Pyndexter explodes. To get around this you will need to quote the directory name:

from pyndexter.util import quote

...

framework.add_source('file://%s' % quote('/var/log/irc/#channel'))

...

or construct the Source object explicitly:

from pyndexter.sources.file import FileSource

...

source = FileSource(framework, '/var/log/irc/#channel')
framework.add_source(source)

...

Filters

Framework.update() normally grinds away silently, giving no feedback. A callback mechanism is provided to allow both filtering of result sets, and feedback.

The callback is a callable with the signature (framework, context, stream) and must pass the iterable stream through to the caller as an iterable.

A filter for generating and displaying timing information is included:

from pyndexter.util import TimingFilter

...

timer = TimingFilter(progressive=True)
framework.update(filter=timer)

...

Indexing Arbitrary Content

Indexing arbitrary content is pretty easy, all you need is a unique identifier to use as the URI, and the content.

The only minor thing you lose is the Framework's ability to automatically determine what has changed in the index.

from pyndexter import *

framework = Framework('builtin:///tmp/pyndexter.idx')

for unique_id, content in some_document_source:
	doc = Document(uri=unique_id, content=content)
	framework.index(doc)

A fairly complete example of a custom document source plus manual indexing is available in the Trac RepoSearch plugin.