root/pyndexter/trunk/.todo

Revision 401, 6.8 KB (checked in by athomas, 4 years ago)

pyndexter: Cosmetic tweaks.

  • Property svn:mimetype set to application/xml
Line 
1<todo version="0.1.19">
2    <title>
3        Pyndexter, pronounced 'poindexter', a full text indexing abstraction layer
4    </title>
5    <note priority="medium" time="1145722536">
6        Callbacks for index() and discard(), perhaps something similar for Source objects?
7        <comment>
8            Framework.update() accepts a filter callback. This could be sufficient.
9        </comment>
10    </note>
11    <note priority="medium" time="1145802778" done="1170655322">
12        Finish PyLucene adapter
13        <comment>
14            Functional enough for a first commit.
15        </comment>
16    </note>
17    <note priority="medium" time="1145854608" done="1146296772">
18        Finish MetaSource
19    </note>
20    <note priority="medium" time="1146321654">
21        I think it might need a MIME filter system, for translating known content types to plain text for indexing. eg. Just the content of HTML pages. This could get out of hand.
22    </note>
23    <note priority="medium" time="1146328561" done="1146368244">
24        state() is being called, which in the naive implementation simply walks the entire source. Need some way around this. Should the state() be accumulated somehow when the source is being walked?
25    </note>
26    <note priority="medium" time="1146331225" done="1146368238">
27        HTTPSource should be able to handle multiple iterations, but self._traversed renders this impossible.
28    </note>
29    <note priority="medium" time="1159011350">
30        For storing state, perhaps there should be default store_state(store)/restore_state(store) methods. Also need a Store class, or just use a file object...
31    </note>
32    <note priority="high" time="1159197046" done="1169000053">
33        Refactor Indexer into two classes: the Indexer itself, and a class that glues Source and the Indexer together. This would remove the duplication I'm getting in all the stock methods (update, index, fetch, etc.)
34        <comment>
35            Done as the Framework class.
36        </comment>
37    </note>
38    <note priority="medium" time="1168868728" done="1169000047">
39        Add slicing to Result objects. This will allow fast pagination in result displays.
40    </note>
41    <note priority="low" time="1168875038" done="1170587379">
42        Add some "stock" query translators (eg. a AND b OR c style, a b or c, +a +b c, etc.)
43        <comment>
44            Added a general to_boolean() method to the Query object. Operators can be overridden for variants.
45        </comment>
46    </note>
47    <note priority="medium" time="1169007320">
48        Incremental updates for the indexer state. Waiting until the end of the index, then writing the state, is bad. A single document error can render the entire index useless.
49        <note priority="medium" time="1169007391">
50            "Transactions" for state updates?
51        </note>
52        <note priority="medium" time="1169090428">
53            I think an anydbm style interface for storing state could be useful.
54        </note>
55    </note>
56    <note priority="medium" time="1169048222" done="1170655393">
57        Add a swish-e adapter. The Python module SwishE only appears to expose searching :(
58        <comment>
59            Done, but only for searching.
60        </comment>
61    </note>
62    <note priority="medium" time="1169086953">
63        Why is Xapian not returning all the hits?
64    </note>
65    <note priority="medium" time="1169116208">
66        I'd like to add database Sources, but I can't see a way to handle updated rows without doing a full table scan.
67    </note>
68    <note priority="medium" time="1169444419">
69        Use metakit for pure-Python implementation? (Check out "divmod pyndex" for ideas)
70    </note>
71    <note priority="medium" time="1170604364" done="1170931795">
72        Deprecate Hit and just use Document - they're almost identical in functionality.
73        <comment>
74            Bad idea. Hit now has indexed and current members, which lazily fetch from the Indexer and Framework, respectively.
75        </comment>
76        <note priority="medium" time="1170812979" done="0">
77            Perhaps Results should use the framework to try and fetch a Document, then "underlay" the hit attributes?
78        </note>
79    </note>
80    <note priority="high" time="1170651530">
81        Add generalised "field" indexing.
82    </note>
83    <note priority="medium" time="1170653876">
84        Search result ordering.
85    </note>
86    <note priority="high" time="1170654664">
87        How do we detect when sources have been removed from the index? If file:///tmp changes to file:///usr, the Framework has no real way of detecting which URI's in the index are no longer valid.
88    </note>
89    <note priority="medium" time="1170685227">
90        Default indexer tasks
91        <note priority="medium" time="1146296806">
92            Optimise on disk format for DefaultIndexer. Use URI/word "ids" rather than full word.
93        </note>
94        <note priority="medium" time="1170685251">
95            Abstract storage mechanism so that sqlite, metakit, anydbm, etc. can be used. This would allow for wide use.
96        </note>
97        <note priority="medium" time="1170685266">
98            Use bigrams same as the current 'default' search? This is a good solution I think. Allows for sub-word searches, start and end of word searches, etc.
99        </note>
100        <note priority="medium" time="1170685271">
101            Optionally use snowball stemmer.
102        </note>
103        <note priority="medium" time="1170685277">
104            Have a built-in stemmer? Porter?
105        </note>
106        <note priority="medium" time="1170685318">
107            Use "nltk" stemmer?
108        </note>
109    </note>
110    <note priority="medium" time="1170686012">
111        http://www.biais.org/blog/index.php/2007/01/31/25-spelling-correction-using-the-python-natural-language-toolkit-nltk &lt;- interesting
112    </note>
113    <note priority="medium" time="1170739349">
114        Pyndex adapter.
115    </note>
116    <note priority="medium" time="1170813131">
117        Add utility function for converting attribute dictionary keys to plain strings (common pattern).
118    </note>
119    <note priority="medium" time="1170829158">
120        Normalise URI usage everywhere.
121    </note>
122    <note priority="veryhigh" time="1170915596">
123        Fix port parsing in util.URI.
124    </note>
125    <note priority="medium" time="1171055477">
126        Write a decent test suite.
127        <note priority="medium" time="1171271157">
128            Test that searches return the right hits. Don't care about order.
129        </note>
130        <note priority="medium" time="1171271356">
131            Test that all interfaces pass and receive unicode correctly.
132        </note>
133        <note priority="medium" time="1171271371">
134            Test that all indexers and sources pass URI objects correctly.
135        </note>
136    </note>
137    <note priority="medium" time="1171530823">
138        http://www.liris.org/tech/program/hyperestraier-purepython/ &lt;- Client library for HE server.
139    </note>
140</todo>
Note: See TracBrowser for help on using the browser.