root/pyndexter/trunk/.todo

Revision 401, 6.8 kB (checked in by athomas, 1 year ago)

pyndexter: Cosmetic tweaks.

  • Property svn:mimetype set to application/xml
Line 
1 <todo version="0.1.19">
2     <title>
3         Pyndexter, pronounced 'poindexter', a full text indexing abstraction layer
4     </title>
5     <note priority="medium" time="1145722536">
6         Callbacks for index() and discard(), perhaps something similar for Source objects?
7         <comment>
8             Framework.update() accepts a filter callback. This could be sufficient.
9         </comment>
10     </note>
11     <note priority="medium" time="1145802778" done="1170655322">
12         Finish PyLucene adapter
13         <comment>
14             Functional enough for a first commit.
15         </comment>
16     </note>
17     <note priority="medium" time="1145854608" done="1146296772">
18         Finish MetaSource
19     </note>
20     <note priority="medium" time="1146321654">
21         I think it might need a MIME filter system, for translating known content types to plain text for indexing. eg. Just the content of HTML pages. This could get out of hand.
22     </note>
23     <note priority="medium" time="1146328561" done="1146368244">
24         state() is being called, which in the naive implementation simply walks the entire source. Need some way around this. Should the state() be accumulated somehow when the source is being walked?
25     </note>
26     <note priority="medium" time="1146331225" done="1146368238">
27         HTTPSource should be able to handle multiple iterations, but self._traversed renders this impossible.
28     </note>
29     <note priority="medium" time="1159011350">
30         For storing state, perhaps there should be default store_state(store)/restore_state(store) methods. Also need a Store class, or just use a file object...
31     </note>
32     <note priority="high" time="1159197046" done="1169000053">
33         Refactor Indexer into two classes: the Indexer itself, and a class that glues Source and the Indexer together. This would remove the duplication I'm getting in all the stock methods (update, index, fetch, etc.)
34         <comment>
35             Done as the Framework class.
36         </comment>
37     </note>
38     <note priority="medium" time="1168868728" done="1169000047">
39         Add slicing to Result objects. This will allow fast pagination in result displays.
40     </note>
41     <note priority="low" time="1168875038" done="1170587379">
42         Add some "stock" query translators (eg. a AND b OR c style, a b or c, +a +b c, etc.)
43         <comment>
44             Added a general to_boolean() method to the Query object. Operators can be overridden for variants.
45         </comment>
46     </note>
47     <note priority="medium" time="1169007320">
48         Incremental updates for the indexer state. Waiting until the end of the index, then writing the state, is bad. A single document error can render the entire index useless.
49         <note priority="medium" time="1169007391">
50             "Transactions" for state updates?
51         </note>
52         <note priority="medium" time="1169090428">
53             I think an anydbm style interface for storing state could be useful.
54         </note>
55     </note>
56     <note priority="medium" time="1169048222" done="1170655393">
57         Add a swish-e adapter. The Python module SwishE only appears to expose searching :(
58         <comment>
59             Done, but only for searching.
60         </comment>
61     </note>
62     <note priority="medium" time="1169086953">
63         Why is Xapian not returning all the hits?
64     </note>
65     <note priority="medium" time="1169116208">
66         I'd like to add database Sources, but I can't see a way to handle updated rows without doing a full table scan.
67     </note>
68     <note priority="medium" time="1169444419">
69         Use metakit for pure-Python implementation? (Check out "divmod pyndex" for ideas)
70     </note>
71     <note priority="medium" time="1170604364" done="1170931795">
72         Deprecate Hit and just use Document - they're almost identical in functionality.
73         <comment>
74             Bad idea. Hit now has indexed and current members, which lazily fetch from the Indexer and Framework, respectively.
75         </comment>
76         <note priority="medium" time="1170812979" done="0">
77             Perhaps Results should use the framework to try and fetch a Document, then "underlay" the hit attributes?
78         </note>
79     </note>
80     <note priority="high" time="1170651530">
81         Add generalised "field" indexing.
82     </note>
83     <note priority="medium" time="1170653876">
84         Search result ordering.
85     </note>
86     <note priority="high" time="1170654664">
87         How do we detect when sources have been removed from the index? If file:///tmp changes to file:///usr, the Framework has no real way of detecting which URI's in the index are no longer valid.
88     </note>
89     <note priority="medium" time="1170685227">
90         Default indexer tasks
91         <note priority="medium" time="1146296806">
92             Optimise on disk format for DefaultIndexer. Use URI/word "ids" rather than full word.
93         </note>
94         <note priority="medium" time="1170685251">
95             Abstract storage mechanism so that sqlite, metakit, anydbm, etc. can be used. This would allow for wide use.
96         </note>
97         <note priority="medium" time="1170685266">
98             Use bigrams same as the current 'default' search? This is a good solution I think. Allows for sub-word searches, start and end of word searches, etc.
99         </note>
100         <note priority="medium" time="1170685271">
101             Optionally use snowball stemmer.
102         </note>
103         <note priority="medium" time="1170685277">
104             Have a built-in stemmer? Porter?
105         </note>
106         <note priority="medium" time="1170685318">
107             Use "nltk" stemmer?
108         </note>
109     </note>
110     <note priority="medium" time="1170686012">
111         http://www.biais.org/blog/index.php/2007/01/31/25-spelling-correction-using-the-python-natural-language-toolkit-nltk &lt;- interesting
112     </note>
113     <note priority="medium" time="1170739349">
114         Pyndex adapter.
115     </note>
116     <note priority="medium" time="1170813131">
117         Add utility function for converting attribute dictionary keys to plain strings (common pattern).
118     </note>
119     <note priority="medium" time="1170829158">
120         Normalise URI usage everywhere.
121     </note>
122     <note priority="veryhigh" time="1170915596">
123         Fix port parsing in util.URI.
124     </note>
125     <note priority="medium" time="1171055477">
126         Write a decent test suite.
127         <note priority="medium" time="1171271157">
128             Test that searches return the right hits. Don't care about order.
129         </note>
130         <note priority="medium" time="1171271356">
131             Test that all interfaces pass and receive unicode correctly.
132         </note>
133         <note priority="medium" time="1171271371">
134             Test that all indexers and sources pass URI objects correctly.
135         </note>
136     </note>
137     <note priority="medium" time="1171530823">
138         http://www.liris.org/tech/program/hyperestraier-purepython/ &lt;- Client library for HE server.
139     </note>
140 </todo>
Note: See TracBrowser for help on using the browser.