| 1 |
<todo version="0.1.19"> |
|---|
| 2 |
<title> |
|---|
| 3 |
Pyndexter, pronounced 'poindexter', a full text indexing abstraction layer |
|---|
| 4 |
</title> |
|---|
| 5 |
<note priority="medium" time="1145722536"> |
|---|
| 6 |
Callbacks for index() and discard(), perhaps something similar for Source objects? |
|---|
| 7 |
<comment> |
|---|
| 8 |
Framework.update() accepts a filter callback. This could be sufficient. |
|---|
| 9 |
</comment> |
|---|
| 10 |
</note> |
|---|
| 11 |
<note priority="medium" time="1145802778" done="1170655322"> |
|---|
| 12 |
Finish PyLucene adapter |
|---|
| 13 |
<comment> |
|---|
| 14 |
Functional enough for a first commit. |
|---|
| 15 |
</comment> |
|---|
| 16 |
</note> |
|---|
| 17 |
<note priority="medium" time="1145854608" done="1146296772"> |
|---|
| 18 |
Finish MetaSource |
|---|
| 19 |
</note> |
|---|
| 20 |
<note priority="medium" time="1146321654"> |
|---|
| 21 |
I think it might need a MIME filter system, for translating known content types to plain text for indexing. eg. Just the content of HTML pages. This could get out of hand. |
|---|
| 22 |
</note> |
|---|
| 23 |
<note priority="medium" time="1146328561" done="1146368244"> |
|---|
| 24 |
state() is being called, which in the naive implementation simply walks the entire source. Need some way around this. Should the state() be accumulated somehow when the source is being walked? |
|---|
| 25 |
</note> |
|---|
| 26 |
<note priority="medium" time="1146331225" done="1146368238"> |
|---|
| 27 |
HTTPSource should be able to handle multiple iterations, but self._traversed renders this impossible. |
|---|
| 28 |
</note> |
|---|
| 29 |
<note priority="medium" time="1159011350"> |
|---|
| 30 |
For storing state, perhaps there should be default store_state(store)/restore_state(store) methods. Also need a Store class, or just use a file object... |
|---|
| 31 |
</note> |
|---|
| 32 |
<note priority="high" time="1159197046" done="1169000053"> |
|---|
| 33 |
Refactor Indexer into two classes: the Indexer itself, and a class that glues Source and the Indexer together. This would remove the duplication I'm getting in all the stock methods (update, index, fetch, etc.) |
|---|
| 34 |
<comment> |
|---|
| 35 |
Done as the Framework class. |
|---|
| 36 |
</comment> |
|---|
| 37 |
</note> |
|---|
| 38 |
<note priority="medium" time="1168868728" done="1169000047"> |
|---|
| 39 |
Add slicing to Result objects. This will allow fast pagination in result displays. |
|---|
| 40 |
</note> |
|---|
| 41 |
<note priority="low" time="1168875038" done="1170587379"> |
|---|
| 42 |
Add some "stock" query translators (eg. a AND b OR c style, a b or c, +a +b c, etc.) |
|---|
| 43 |
<comment> |
|---|
| 44 |
Added a general to_boolean() method to the Query object. Operators can be overridden for variants. |
|---|
| 45 |
</comment> |
|---|
| 46 |
</note> |
|---|
| 47 |
<note priority="medium" time="1169007320"> |
|---|
| 48 |
Incremental updates for the indexer state. Waiting until the end of the index, then writing the state, is bad. A single document error can render the entire index useless. |
|---|
| 49 |
<note priority="medium" time="1169007391"> |
|---|
| 50 |
"Transactions" for state updates? |
|---|
| 51 |
</note> |
|---|
| 52 |
<note priority="medium" time="1169090428"> |
|---|
| 53 |
I think an anydbm style interface for storing state could be useful. |
|---|
| 54 |
</note> |
|---|
| 55 |
</note> |
|---|
| 56 |
<note priority="medium" time="1169048222" done="1170655393"> |
|---|
| 57 |
Add a swish-e adapter. The Python module SwishE only appears to expose searching :( |
|---|
| 58 |
<comment> |
|---|
| 59 |
Done, but only for searching. |
|---|
| 60 |
</comment> |
|---|
| 61 |
</note> |
|---|
| 62 |
<note priority="medium" time="1169086953"> |
|---|
| 63 |
Why is Xapian not returning all the hits? |
|---|
| 64 |
</note> |
|---|
| 65 |
<note priority="medium" time="1169116208"> |
|---|
| 66 |
I'd like to add database Sources, but I can't see a way to handle updated rows without doing a full table scan. |
|---|
| 67 |
</note> |
|---|
| 68 |
<note priority="medium" time="1169444419"> |
|---|
| 69 |
Use metakit for pure-Python implementation? (Check out "divmod pyndex" for ideas) |
|---|
| 70 |
</note> |
|---|
| 71 |
<note priority="medium" time="1170604364" done="1170931795"> |
|---|
| 72 |
Deprecate Hit and just use Document - they're almost identical in functionality. |
|---|
| 73 |
<comment> |
|---|
| 74 |
Bad idea. Hit now has indexed and current members, which lazily fetch from the Indexer and Framework, respectively. |
|---|
| 75 |
</comment> |
|---|
| 76 |
<note priority="medium" time="1170812979" done="0"> |
|---|
| 77 |
Perhaps Results should use the framework to try and fetch a Document, then "underlay" the hit attributes? |
|---|
| 78 |
</note> |
|---|
| 79 |
</note> |
|---|
| 80 |
<note priority="high" time="1170651530"> |
|---|
| 81 |
Add generalised "field" indexing. |
|---|
| 82 |
</note> |
|---|
| 83 |
<note priority="medium" time="1170653876"> |
|---|
| 84 |
Search result ordering. |
|---|
| 85 |
</note> |
|---|
| 86 |
<note priority="high" time="1170654664"> |
|---|
| 87 |
How do we detect when sources have been removed from the index? If file:///tmp changes to file:///usr, the Framework has no real way of detecting which URI's in the index are no longer valid. |
|---|
| 88 |
</note> |
|---|
| 89 |
<note priority="medium" time="1170685227"> |
|---|
| 90 |
Default indexer tasks |
|---|
| 91 |
<note priority="medium" time="1146296806"> |
|---|
| 92 |
Optimise on disk format for DefaultIndexer. Use URI/word "ids" rather than full word. |
|---|
| 93 |
</note> |
|---|
| 94 |
<note priority="medium" time="1170685251"> |
|---|
| 95 |
Abstract storage mechanism so that sqlite, metakit, anydbm, etc. can be used. This would allow for wide use. |
|---|
| 96 |
</note> |
|---|
| 97 |
<note priority="medium" time="1170685266"> |
|---|
| 98 |
Use bigrams same as the current 'default' search? This is a good solution I think. Allows for sub-word searches, start and end of word searches, etc. |
|---|
| 99 |
</note> |
|---|
| 100 |
<note priority="medium" time="1170685271"> |
|---|
| 101 |
Optionally use snowball stemmer. |
|---|
| 102 |
</note> |
|---|
| 103 |
<note priority="medium" time="1170685277"> |
|---|
| 104 |
Have a built-in stemmer? Porter? |
|---|
| 105 |
</note> |
|---|
| 106 |
<note priority="medium" time="1170685318"> |
|---|
| 107 |
Use "nltk" stemmer? |
|---|
| 108 |
</note> |
|---|
| 109 |
</note> |
|---|
| 110 |
<note priority="medium" time="1170686012"> |
|---|
| 111 |
http://www.biais.org/blog/index.php/2007/01/31/25-spelling-correction-using-the-python-natural-language-toolkit-nltk <- interesting |
|---|
| 112 |
</note> |
|---|
| 113 |
<note priority="medium" time="1170739349"> |
|---|
| 114 |
Pyndex adapter. |
|---|
| 115 |
</note> |
|---|
| 116 |
<note priority="medium" time="1170813131"> |
|---|
| 117 |
Add utility function for converting attribute dictionary keys to plain strings (common pattern). |
|---|
| 118 |
</note> |
|---|
| 119 |
<note priority="medium" time="1170829158"> |
|---|
| 120 |
Normalise URI usage everywhere. |
|---|
| 121 |
</note> |
|---|
| 122 |
<note priority="veryhigh" time="1170915596"> |
|---|
| 123 |
Fix port parsing in util.URI. |
|---|
| 124 |
</note> |
|---|
| 125 |
<note priority="medium" time="1171055477"> |
|---|
| 126 |
Write a decent test suite. |
|---|
| 127 |
<note priority="medium" time="1171271157"> |
|---|
| 128 |
Test that searches return the right hits. Don't care about order. |
|---|
| 129 |
</note> |
|---|
| 130 |
<note priority="medium" time="1171271356"> |
|---|
| 131 |
Test that all interfaces pass and receive unicode correctly. |
|---|
| 132 |
</note> |
|---|
| 133 |
<note priority="medium" time="1171271371"> |
|---|
| 134 |
Test that all indexers and sources pass URI objects correctly. |
|---|
| 135 |
</note> |
|---|
| 136 |
</note> |
|---|
| 137 |
<note priority="medium" time="1171530823"> |
|---|
| 138 |
http://www.liris.org/tech/program/hyperestraier-purepython/ <- Client library for HE server. |
|---|
| 139 |
</note> |
|---|
| 140 |
</todo> |
|---|