pyndexter.__init__
Pyndexter provides a uniform API for accessing a variety of full-text indexing engines. It is similar in purpose to the Python DB API.
The main class users will be dealing with is Framework. This presents a convenient interface to the backend indexers.
An example of indexing all .txt files underneath /usr/share/doc:
import os from pyndexter import Framework, Document framework = Framework('hyperestraier:///tmp/hyperestraier.idx') path = '/usr/share/doc' for file in [path + f for f in os.listdir(path) if f.endswith('.txt')]: doc = Document(file, open(file).read()) framework.index(doc) # Find all documents with Linus and Torvalds in them for hit in framework.search('Linus Torvalds'): print hit.uri framework.close()class Error()
Base of all pyndexter exceptions.
class DocumentNotFound()
Raised when a document could not be found, usually by the fetch() methods.
class InvalidURI()
The URI provided was invalid in that context.
class SourceError()
Base of all exceptions raised exclusively by Sources.
class InvalidState()
The state provided to a source was invalid.
class IndexerError()
Base of all exceptions raised exclusively by Indexers.
class InvalidMode()
The mode (READONLY or READWRITE) of the indexer is an invalid state for a particular operation.
class InvalidQuery()
Invalid query string.
class FrameworkError()
Base of Framework errors.
class InvalidModule()
The module provided was not loadable.
__init__(self, module, exception=None)(Not documented)
class Document()
A Document represents an indexable object in pyndexter. All string attributes must be unicode, including the content.
- content
- Optional, and if not provided will be fetched from the source. If it is a callable, it will be called to fetch the content, passing the uri as the only argument.
- change
- Should be a numeric value representing the current point in the documents lifetime. Typically a timestamp, but could be a revision number, etc.
- source
- Is the Source object where this documents content can be lazily fetched from.
__init__(self, uri, content=None, source=None, quality=1.0, **attributes)(Not documented)
__repr__(self)(Not documented)
__getattr__(self, key)(Not documented)
__contains__(self, key)(Not documented)
__hash__(self)(Not documented)
get(self, key, default=None)(Not documented)
class QueryNode()
A query parse node.
>>> QueryNode(QueryNode.TERM, 'one') ("one") >>> QueryNode(QueryNode.AND, ... left=QueryNode(QueryNode.TERM, 'one'), ... right=QueryNode(QueryNode.TERM, 'two')) (and ("one") ("two")) >>> QueryNode(QueryNode.NOT, left=QueryNode(QueryNode.TERM, 'one')) (not ("one") nil)class Query()
Query parser. Converts a simple query language into a parse tree which Indexers can then convert into their own implementation-specific representation.
The query language is in the following form:
<term> <term> document must contain all of these terms "some term" return documents matching this exact phrase -<term> exclude documents containing this term <term> or <term> return documents matching either term <attr>:<term> return documents with term in the specified attribute
eg.
>>> Query('lettuce tomato -cheese') (and ("lettuce") (and ("tomato") (not ("cheese") nil)))>>> Query('"mint slices" -timtams') (and ("mint slices") (not ("timtams") nil))>>> Query('"brie cheese" or "camembert cheese"') (or ("brie cheese") ("camembert cheese"))>>> Query('one two:three') (and ("one") (two:"three"))__init__(self, phrase)(Not documented)
parse(self, tokens)(Not documented)
parse_unary(self, tokens)Parse a unary operator. Currently only NOT.
>>> q = Query('') >>> q.parse_unary(q._tokenise('-foo')) (not ("foo") nil)parse_terminal(self, tokens)Parse a terminal token.
>>> q = Query('') >>> q.parse_terminal(q._tokenise('foo')) ("foo")terms(self, exclude_not=True)A generator returning the terms contained in the Query.
__call__(self, text)Match the query against a block of text. The Query will be lazily compiled to Python code.
as_string(self, and_=' AND ', or_=' OR ', not_='NOT ')Convert Query to a boolean expression. Useful for indexers with "typical" boolean query syntaxes.
eg. "term AND term OR term AND NOT term"
The expanded operators can be customised for syntactical variations.
>>> Query('foo bar').as_string() 'foo AND bar' >>> Query('foo bar or baz').as_string() 'foo AND bar OR baz' >>> Query('foo -bar or baz').as_string() 'foo AND NOT bar OR baz'reduce(self, reduce)Pass each TERM node through Reducer.
class Reducer()
Compact all words in a block of text.
__init__(self, words_re='\\w+', stemmer=w0w, min_word_length=3, max_word_length=64, unique=False, split=False, lower=True)words_re is a regular expression object or string.
stemmer is a callable that stems a single word.
If unique is true, return a string of unordered words with duplicates removed.
If split is true, return words in a collection rather than joining them into a single string.
If lower is true, lowercase text.
__call__(self, text, unique=None, split=None)(Not documented)
class Indexer()
An Indexer performs document indexing and searching. This base object provides a framework for indexers.
__init__(self, framework)Initialise indexer.
close(self)Close the indexer. The object is subsequently not usable.
flush() is automatically called by the Framework prior to close().
index(self, document)Index a single Document object.
discard(self, uri)Discard a document.
search(self, query)Search with the given Query.
__iter__(self)Iterate over all URI's in the index.
fetch(self, uri)Attempt to fetch indexer representation of the document.
Must return a Document object with a quality attribute between 0.0 and 1.0, representing the quality of the document in comparison to the original.
replace(self, document)Replace a document in the index. Default is to discard() and index().
optimise(self)Optimise the indexer.
flush(self)Flush indexer state to disk.
state_store(self)If this Indexer is capable of storing framework state, return a StateStore object. By default, if the indexer has a state_path attribute, a new StateStore object will be returned on that path.
class PluginFactory()
Factory for translating URL-style query parameters into a standard plugin constructor call.
>>> class C: ... def __init__(self, one, two, three=3): ... print one, two, three >>> f = PluginFactory(C, three=int, four="three") >>> c = f(one=1, two=2, three=3) 1 2 3 >>> c = f(uri='scheme://?one=1&two=2&three=three') Traceback (most recent call last): ... ValueError: could not coerce argument "three" with value "three" to type "<type 'int'>": invalid literal for int(): three >>> c = f(uri='scheme://?one=1&two=2&four=3') 1 2 3
class List()
Translate a parameter that is a list of elements of type, optionally splitting on commas.
__init__(self, plugin, **arg_types)Create a new factory.
arg_types is a dictionary of <arg>:<type> mappings. If <type> is a string, <arg> will be renamed to this before calling the plugin constructor.
__call__(self, uri=None, **kwargs)(Not documented)
class Framework()
The glue. Ties Indexer and Source together, performs housekeeping tasks and provides a convenient interface to it all.
If the Indexer is not capable of storing state and automatic updates are desired, a StateStore object should be passed to the Framework.
__init__(self, indexer=None, mode=READWRITE, reduce=None, stemmer=None)indexer is a URI used to construct an indexer, or an Indexer object.
reduce is a Reducer object.If reduce is not specified, a default Reduce object will be instantiated using stemmer (URI or callable) as defaults. '''NOTE:''' Use of the reducer is optional - some indexers may implement stemming and reduction internally.
set_indexer(self, indexer)Set the Framework indexer. Can either be a URI or an Indexer object.
get_indexer(self)(Not documented)
fetch(self, uri)Fetch a document.
__iter__(self)Iterate over all URI's in the indexer.
index(self, document)Index a single document, specified as a Document object.
discard(self, document)Discard the specified document from the index, specified as either a Document object or a URI.
replace(self, document)Replace document in the index, specified as a Document object.
search(self, query)Search the index for documents matching the given query. This method is guaranteed to work across all indexers.
query is a pyndexter compatible search string.
Returns a Result object.
close(self)Sync and close the indexer. The object is subsequently not usable.
optimise(self)Optimise the indexer.
flush(self)Flush indexer state to disk.
class Result()
Represents the result of a search. Each hit is returned as a Hit object.
__init__(self, indexer, query, context)(Not documented)
__iter__(self)Return an iterator over the result set, returning a Hit object for each matching document.
__len__(self)Return the length of the result set.
__getitem__(self, index)Return a Hit object for a specific index in the search result. Not necessarily implemented by all Indexers.
__getslice__(self, i, j)Return an iterator over a slice of the search set.
class Hit()
Wrapper around a search hit. If current is a callable, it should be a function that fetches the Document associated with uri, which is passed as the only argument.
__init__(self, uri, current=None, indexed=None, **attributes)(Not documented)
get(self, key, default=None)Get an attribute, but if it doesn't exist return a default value.
get_document(self)Fetch the active document, preferring to fetch a fresh document from the source, but falling back on the indexed version.
__getattr__(self, key)Access hit attributes.
__contains__(self, key)Determine whether a Hit contains an attribute.
__repr__(self)(Not documented)
