Ticket #26 (closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

No support for unique ID's in Xapwrap? Use Xapian Python bindings directly

Reported by: olly@… Owned by: athomas
Priority: major Component: pyndexter
Severity: normal Keywords:
Cc:

Description

This probably explains your poor indexing performance from Xapian:

7      # XXX Is a numeric ID the only way to uniquely identify documents in Xapian?
8      # XXX This seems crazy, and prone to error.
9      def uri2id(uri):
10         return abs(hash(uri))

This approach risks hash collision, and also means that document ids are allocated in essentially random order, which makes adding documents much less efficient.

If you were using the Xapian API directly, you would add the UID as a boolean term using document.add_term("Q" + uid) ("Q" is the convential prefix for a uniQue term), and then add/replace a document using database.replace_document("Q" + uid, document).

As far as I can tell, that's not possible to use through xapwrap - in 0.3 all the calls to replace_document either explicitly force the uid to be numeric or use a numeric value from somewhere else. The details given for SVN access don't work, so I can't see if this has been fixed.

This variant of replace_document was added to Xapian on 2004-09-13, xapwrap 0.3 was released 2005-10-11 judging by the timestamps of the files in the tarball).

I suspect for what your doing, you might be better off using Xapian's python bindings directly.

Attachments

Change History

Changed 4 years ago by athomas

This approach risks hash collision,

Yes, the XXX's were for exactly that reason. But if Xapwrap doesn't expose any mechanism for achieving this...

document ids are allocated in essentially random order, which makes adding documents much less efficient.

Hmm okay.

I suspect for what your doing, you might be better off using Xapian's python bindings directly.

I attempted to use the Python bindings initially, but there seemed to be a dearth of decent documentation and examples. If you have links to decent examples and documentation, it would be more than welcome. Even more welcome would be code ;)

Changed 4 years ago by athomas

  • summary changed from Don't hash URIs to produce uids to No support for unique ID's in Xapwrap? Use Xapian Python bindings directly

Okay I am making some progress using your suggestions. Will update the ticket as I progress.

Changed 4 years ago by athomas

Is iterating over Document.termlist() (from the search result) the most efficient way of finding the uniQue term?

Changed 4 years ago by olly@…

Regarding documentation and examples for the Python bindings, currently there's the "bindings.html" and 4 examples which come with the bindings, and the C++ API documentation. I realise this isn't ideal from someone wanted to code only in Python (or any of the other languages we have bindings for) so I'm investigating how we can automatically produce pydoc strings from the doxygen documentation comments in the API docs with manual overrides for methods where this isn't feasible.

I think we really need something automated for most classes and methods or else the effort required to update the documentation for C++ and N languages will be prohibitive.

Meanwhile, if you have specific questions, feel free to ask on the xapian mailing list.

Regarding the question about finding the uniQue term, it would be better to use skip_to("Q") but that doesn't seem to be available via the TermIter wrapper. Just use iteration for now - it's fairly efficient. I'll add skip_to methods to the pythonic iterator wrappers shortly.

Changed 4 years ago by anonymous

Xapian 0.9.7 (the release is spinning as I type) adds a skip_to method to the Term Iter? wrapper class.

Changed 4 years ago by athomas

  • status changed from new to closed
  • resolution set to fixed

(In [363]) pyndexter: Fixed #26, along with adding query translation.

Add/Change #26 (No support for unique ID's in Xapwrap? Use Xapian Python bindings directly)

Author


E-mail address and user name can be saved in the Preferences.


Change Properties
<Author field>
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.