Changeset 332

Show
Ignore:
Timestamp:
30/04/06 14:09:58 (4 years ago)
Author:
athomas
Message:

pyndexter:

  • Added a horribly inefficient built-in indexer, default.DefaultIndexer. There seems to be a memory leak somewhere, so on large datasets the indexer will consume large amounts of memory.
  • Added a util module with CacheDict, currently used by the default indexer.
  • Added some more CAP_ bits.
  • Source state data is now accumulated by the Source classes __iter__() method, no longer requiring a full walk of the source to collect state. This means an Indexer.update() will automagically do the right thing.
  • Factored out some common environment initialisation code into Indexer._init_env().
  • Factored FileSource include/exclude/predicate code into base Source class so it can be reused.
Location:
pyndexter/trunk
Files:
3 added
6 modified

Legend:

Unmodified
Added
Removed
  • pyndexter/trunk/.todo

    r329 r332  
    99        Finish PyLucene adapter 
    1010    </note> 
    11     <note priority="medium" time="1145854608"> 
     11    <note priority="medium" time="1145854608" done="1146296772"> 
    1212        Finish MetaSource 
    1313    </note> 
     14    <note priority="medium" time="1146296806"> 
     15        Optimise on disk format for DefaultIndexer. Use URI/word "ids" rather than full word. 
     16    </note> 
     17    <note priority="medium" time="1146321654"> 
     18        I think it might need a MIME filter system, for translating known content types to plain text for indexing. eg. Just the content of HTML pages. This could get out of hand. 
     19    </note> 
     20    <note priority="medium" time="1146328561" done="1146368244"> 
     21        state() is being called, which in the naive implementation simply walks the entire source. Need some way around this. Should the state() be accumulated somehow when the source is being walked? 
     22    </note> 
     23    <note priority="medium" time="1146331225" done="1146368238"> 
     24        HTTPSource should be able to handle multiple iterations, but self._traversed renders this impossible. 
     25    </note> 
    1426</todo> 
  • pyndexter/trunk/pyndexter/__init__.py

    r331 r332  
    2323 
    2424CAP_READONLY CAP_ORDERING CAP_CONTENT CAP_ATTRIBUTES CAP_RELEVANCE CAP_HITCOUNT 
    25 CAP_LIST CAP_ITERATION 
     25CAP_LIST CAP_ITERATION CAP_ASTERISK CAP_QUESTION CAP_WHOLEWORD CAP_UNION 
     26CAP_INTERSECTION 
    2627 
    2728Document Source Indexer Search Hit 
     
    4647CAP_LIST = 64           # Search result supports list-style lookup 
    4748CAP_ITERATION = 128     # Supports index iteration 
     49CAP_ASTERISK = 256      # Supports the asterisk wildcard (*<term>*) 
     50CAP_QUESTION = 512      # Supports the single character wildcard (a?c) 
     51CAP_WHOLEWORD = 512     # Performs whole word searches by default 
     52CAP_UNION = 1024        # Supports unions (ie. matches documents with any word) 
     53CAP_INTERSECTION = 2048 # Supports intersections (ie. matches documents with 
     54                        # all words) 
    4855 
    4956 
     
    97104            return self.attributes[key] 
    98105        except KeyError, e: 
    99             raise AttributeError(str(e)) 
     106            raise AttributeError(unicode(e)) 
    100107 
    101108    def __hash__(self): 
     
    116123    """ A source of indexable documents. A Source object is responsible for not 
    117124    only fetching documents and iterating over them, but for determining what 
    118     has changed in the source. This is achieved with the state() and 
    119     difference() methods. The ''state'' of a source is the minimum information 
    120     required to be able to determine what has changed. For FileSource this is a 
    121     list of all files and their modification times, for a SubversionSource it 
    122     would be as simple as the changeset number. 
     125    has changed in the source. 
     126     
     127    Determing what has changed is achieved with the state() and difference() 
     128    methods. The ''state'' of a source is the minimum information required to 
     129    be able to determine what has changed. For FileSource this is a list of all 
     130    files and their modification times, for a SubversionSource it would be as 
     131    simple as the changeset number. The default state() and difference() 
     132    methods use the data in self._state. 
    123133 
    124134    (All attributes, including document contents and URI's must be in unicode) 
    125135    """ 
     136 
     137    def __init__(self, include=['*'], exclude=[], predicate=None): 
     138        self.include = include 
     139        self.exclude = exclude 
     140        self.predicate = predicate or self._glob_predicate 
     141        self._state = {} 
    126142 
    127143    def matches(self, uri): 
     
    158174        """ Return a raw byte string representing the current state of this 
    159175        source.  Storage and retrieval of this byte string is typically handled 
    160         by the Indexer. """ 
    161         state = {} 
    162         for uri in self: 
    163             state[uri] = self.fetch(uri).changed 
    164         state = pickle.dumps(state, 2) 
    165         compressed = StringIO() 
    166         gzip.GzipFile(filename='pyndexer source state', fileobj=compressed, 
    167                       mode='wb', compresslevel=1).write(state) 
    168         return compressed.getvalue() 
     176        by the Indexer. If this method returns false, the Indexer will assume 
     177        that state information is not available, and do nothing. """ 
     178        if not self._state: 
     179            return None 
     180        return self._marshal_state(self._state) 
    169181 
    170182    def difference(self, state): 
     
    173185        tuple is in the form `(<transition>, uri)`, where <transition> is one 
    174186        of ADDED, REMOVED or MODIFIED. """ 
    175         state = StringIO(state) 
    176         try: 
    177             ungzipped = gzip.GzipFile(fileobj=state, mode='rb').read() 
    178             state = pickle.loads(ungzipped) 
    179         except Exception, e: 
    180             raise InvalidState('Invalid state provided to document source. ' 
    181                                'Exception was %s: %s' % (e.__class__.__name__, e)) 
    182187        current = set() 
     188        state = self._unmarshal_state(state) 
    183189        for uri in self: 
    184190            current.add(uri) 
     
    190196            yield (REMOVED, removed) 
    191197 
     198    # Useful helper methods 
     199    def _glob_predicate(self, uri): 
     200        """ Given a list of include and exclude pattern lists, return whether 
     201        the given uri matches. """ 
     202        from fnmatch import fnmatch 
     203        for pattern in self.exclude: 
     204            if fnmatch(uri, pattern): 
     205                return False 
     206        for pattern in self.include: 
     207            if fnmatch(uri, pattern): 
     208                return True 
     209        return False 
     210 
     211    def _marshal_state(self, state): 
     212        """ Pickle and compress state. This is used by the default state() 
     213        implementation, but can be reused. """ 
     214        state = pickle.dumps(state, 2) 
     215        compressed = StringIO() 
     216        gzip.GzipFile(filename='pyndexer source state', fileobj=compressed, 
     217                      mode='wb', compresslevel=1).write(state) 
     218        return compressed.getvalue() 
     219 
     220    def _unmarshal_state(self, state): 
     221        """ Uncompress and unpickle state. Used by the default difference() 
     222        method, but can be reused. """ 
     223        state = StringIO(state) 
     224        try: 
     225            ungzipped = gzip.GzipFile(fileobj=state, mode='rb').read() 
     226            return pickle.loads(ungzipped) 
     227        except Exception, e: 
     228            raise InvalidState('Invalid state provided to document source. ' 
     229                               'Exception was %s: %s' % (e.__class__.__name__, e)) 
    192230 
    193231class Indexer(object): 
     
    209247        on the Source copy, if available. """ 
    210248        if not self.source: 
    211             raise IndexerError("This indexer has no Source object associated " 
    212                                "with and as such can not fetch() documents.") 
     249            raise IndexerError("This indexer has no associated Source object " 
     250                               "and as such can not fetch() documents.") 
    213251        return self.source.fetch(uri) 
    214252 
     
    226264                               "capable of automatic updates.") 
    227265        if os.path.exists(self.state_path): 
    228             state = open(self.state_path).read() 
     266            try: 
     267                state = open(self.state_path).read() 
     268            except Exception, e: 
     269                raise IndexerError("Source state '%s' is not readable. " 
     270                                   "Exception was %s: %s" %  
     271                                   (self.state_path, e.__class__.__name__, 
     272                                    unicode(e))) 
     273 
    229274            for transition, uri in self.source.difference(state): 
    230275                if transition == REMOVED: 
     
    282327        constructor. """ 
    283328        if self.mode == READWRITE and self.source and self.state_path: 
    284             open(self.state_path, 'w').write(self.source.state()) 
     329            state = self.source.state() 
     330            if state: 
     331                open(self.state_path, 'w').write(self.source.state()) 
     332 
     333    def _init_env(self, path): 
     334        """ Create a default environment with a <path> base directory. """ 
     335        if not os.path.exists(path): 
     336            if self.mode != READWRITE: 
     337                raise IndexError("Indexer environment has not been initialised") 
     338            os.makedirs(path) 
    285339 
    286340class Search(object): 
     
    329383            return self.attributes[key] 
    330384        except KeyError, e: 
    331             raise AttributeError(str(e)) 
     385            raise AttributeError(unicode(e)) 
    332386 
    333387    def _get_document(self): 
  • pyndexter/trunk/pyndexter/file.py

    r331 r332  
    11import sys 
    22import codecs 
    3 import os.path 
    4 from fnmatch import fnmatch 
    5 from dircache import listdir 
     3import os 
     4from stat import * 
    65from urlparse import urlsplit, urlunsplit 
    76 
     
    1110    def __init__(self, root, include=['*'], exclude=[], predicate=None): 
    1211        """ Expose a subset of the file system for searching. """ 
     12        Source.__init__(self, include, exclude, predicate) 
    1313        self.root = os.path.normpath(root) 
    14         self.include = include 
    15         self.exclude = exclude 
    16         self.predicate = predicate or self._glob_predicate 
    1714        self.encoding = sys.getfilesystemencoding() 
    1815 
     
    2118            path = path.strip(os.path.sep) 
    2219            root_path = os.path.join(self.root, path) 
    23             for file in listdir(root_path): 
     20            for file in os.listdir(root_path): 
    2421                full_path = os.path.join(root_path, file) 
    25                 if os.path.isdir(full_path): 
     22                try: 
     23                    stat = os.lstat(full_path) 
     24                except OSError: 
     25                    continue 
     26                if not self.predicate(full_path) or not os.access(full_path, os.R_OK): 
     27                    continue 
     28                if S_ISDIR(stat.st_mode): 
    2629                    for file in walk_path(os.path.join(path, file)): 
    2730                        yield file 
    28                 elif self.predicate(full_path) and os.path.exists(full_path): 
    29                     # TODO Stat for normal files + readability 
    30                     yield self._file2uri(full_path) 
     31                elif S_ISREG(stat.st_mode): 
     32                    yield (self._file2uri(full_path).decode(self.encoding), stat) 
    3133 
    32         for file in walk_path('/'): 
    33             yield file.decode(self.encoding) 
     34        for file, stat in walk_path('/'): 
     35            self._state[file] = stat.st_mtime 
     36            yield file 
    3437 
    3538    def matches(self, uri): 
    36         scheme, netloc, path, query, fragment = urlsplit(uri) 
     39        scheme, netloc, path, query, fragment = urlsplit(uri, 'file') 
    3740        path = os.path.normpath(path) 
    38         return scheme in ('file', '') and \ 
     41        return scheme == 'file' and \ 
    3942               path.startswith(self.root) and \ 
    4043               self.predicate(path) 
     
    6770 
    6871    def _uri2file(self, uri): 
    69         scheme, location, path, query, fragment = urlsplit(uri) 
    70         if scheme not in ('file', ''): 
     72        scheme, location, path, query, fragment = urlsplit(uri, 'file') 
     73        if scheme not in 'file': 
    7174            raise InvalidURI("URI scheme in '%s' not supported by FileSource" 
    7275                             % scheme) 
     
    7679                             % uri) 
    7780        return path.decode(self.encoding) 
    78  
    79     def _glob_predicate(self, file): 
    80         for pattern in self.exclude: 
    81             if fnmatch(file, pattern): 
    82                 return False 
    83         for pattern in self.include: 
    84             if fnmatch(file, pattern): 
    85                 return True 
    86         return False 
    87          
  • pyndexter/trunk/pyndexter/hyperestraier.py

    r331 r332  
    55class HyperestraierIndexer(Indexer): 
    66    capabilities = CAP_READONLY | CAP_CONTENT | CAP_ATTRIBUTES | CAP_ORDERING |\ 
    7                    CAP_HITCOUNT | CAP_LIST | CAP_RELEVANCE 
     7                   CAP_HITCOUNT | CAP_LIST | CAP_RELEVANCE | CAP_WHOLEWORD | \ 
     8                   CAP_ASTERISK | CAP_INTERSECTION 
    89 
    910    def __init__(self, path, source=None, mode=READWRITE, hype_mode=None): 
    1011        Indexer.__init__(self, source, mode, os.path.join(path, 'state.db')) 
    1112        self.path = path 
    12         if not os.path.exists(self.path): 
    13             if mode != READWRITE: 
    14                 raise IndexerError("Index directory has not been initialised") 
    15             os.makedirs(self.path) 
     13        self._init_env(self.path) 
    1614        self.hype_path = os.path.join(self.path, 'hyperestraier.db') 
    1715        if hype_mode is None: 
     
    6058            else: 
    6159                order_type = 'STR' 
    62             print order_ascending 
    6360            order = u'@%s %s%s' % (order_by, order_type, 
    6461                                   order_ascending and 'A' or 'D') 
  • pyndexter/trunk/pyndexter/metasource.py

    r331 r332  
    55class MetaSource(Source): 
    66    """ A collection of sources. If sources serve the same documents the 
    7     results are undefined, and probably not good. """ 
     7    results will be undefined, and probably not good. """ 
    88    def __init__(self, sources=[]): 
    99        self.sources = sources 
     
    3131            if source.matches(uri): 
    3232                return source.fetch(uri) 
    33         raise DocumentNotFound 
     33        raise DocumentNotFound(uri) 
    3434 
    3535    def exists(self, uri): 
     
    4949            state = pickle.loads(state) 
    5050        except Exception, e: 
    51             raise InvalidState('Invalid state provided to document source. ' 
     51            raise InvalidState('Invalid state provided to MetaSource. ' 
    5252                               'Exception was %s: %s' % (e.__class__.__name__, e)) 
    5353        for source in self.sources: 
     
    5858                for change in source.difference(state[hash(source)]): 
    5959                    yield change 
    60  
  • pyndexter/trunk/pyndexter/xapian.py

    r329 r332  
    1616 
    1717class XapianIndexer(Indexer): 
    18     capabilities = CAP_ORDERING | CAP_READONLY | CAP_ATTRIBUTES | CAP_RELEVANCE | \ 
    19                    CAP_HITCOUNT | CAP_LIST 
     18    capabilities = CAP_ORDERING | CAP_READONLY | CAP_ATTRIBUTES | \ 
     19                   CAP_RELEVANCE | CAP_HITCOUNT | CAP_LIST | CAP_WHOLEWORD | \ 
     20                   CAP_INTERSECTION 
    2021 
    2122    def __init__(self, path, source=None, mode=READWRITE): 
    2223        Indexer.__init__(self, source, mode, os.path.join(path, 'state.db')) 
    2324        self.path = path 
    24         if not os.path.exists(self.path): 
    25             if mode != READWRITE: 
    26                 raise IndexerError("Index directory has not been initialised") 
    27             os.makedirs(self.path) 
     25        self._init_env(self.path) 
    2826        self.idx_path = os.path.join(path, 'xapian.db') 
    2927        if mode == READWRITE: 
     
    6866    def search(self, phrase, order_by=None, order_ascending=True, 
    6967               order_type=str, intersection=True): 
     68        phrase = phrase.encode('utf-8') 
    7069        if order_by == 'relevance': 
    7170            order_args = {'sortByRelevence': True}