Ticket #26 (closed defect: fixed)
No support for unique ID's in Xapwrap? Use Xapian Python bindings directly
| Reported by: | olly@… | Owned by: | athomas |
|---|---|---|---|
| Priority: | major | Component: | pyndexter |
| Severity: | normal | Keywords: | |
| Cc: |
Description
This probably explains your poor indexing performance from Xapian:
7 # XXX Is a numeric ID the only way to uniquely identify documents in Xapian? 8 # XXX This seems crazy, and prone to error. 9 def uri2id(uri): 10 return abs(hash(uri))
This approach risks hash collision, and also means that document ids are allocated in essentially random order, which makes adding documents much less efficient.
If you were using the Xapian API directly, you would add the UID as a boolean term using document.add_term("Q" + uid) ("Q" is the convential prefix for a uniQue term), and then add/replace a document using database.replace_document("Q" + uid, document).
As far as I can tell, that's not possible to use through xapwrap - in 0.3 all the calls to replace_document either explicitly force the uid to be numeric or use a numeric value from somewhere else. The details given for SVN access don't work, so I can't see if this has been fixed.
This variant of replace_document was added to Xapian on 2004-09-13, xapwrap 0.3 was released 2005-10-11 judging by the timestamps of the files in the tarball).
I suspect for what your doing, you might be better off using Xapian's python bindings directly.
