You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by James Nolan <no...@gmail.com> on 2012/06/15 16:30:14 UTC
De-duplication of poly-field data using term vector positions

Hello All!

I'm reasonable new to solr, lucene, typing etc. and am trying to create
documents which extend AbstractSubTypeFieldType in order to load multiple
"Activities" into a document.  For example: I have biographical information
about a customer, but also want to store all shopping transactions for a
given user in the same document.  To do this, I load the data as an
extended AbstractSubTypeFieldType and overwrite the getFieldQuery method to
only accept one type of search which I then split and divide amongst the
appropriate sub-types (Range queries for dates and match queries for
everything else).  The initial results match all of the data such that a
search for a book sale on a given date would initially match if the
customer had a matching date and matching book anywhere in the custom
extended AbstractSubTypeFieldType.  (I think) Since lucene knows the actual
positions of the values in each field, I can, after the search is
completed, get the term vector positions and compare those with the terms
searched.  I do this by creating a custom add-on which extends
org.apache.solr.handler.component.TermVectorComponent.  In its process
phase, it calls the org.apache.lucene.index.IndexReader.getTermFreqVector
method from the IndexReader returned by the response builder's
req.getSearcher().getReader() method and creates a list of terms and their
positions using the TVMapper's (extended
org.apache.lucene.index.TermVectorMapper) map method.  I do set
intersections on all values searched to determine if a given record
actually has hit terms which are in the same position.  If they do, then I
return the list of matching documents to the client which passes them along
to the user.

Here is my setup:
I have 3 servers each running 5 cores of solr/lucene 3.6 over tomcat 6.0.29
and I use the shards parameter to split the search between all 15 cores.

My problem is this:
The initial search is extremely fast, however, using the
IndexReader.getTermFreqVector() method is very slow.  I tried to use the
.getTermFreqVector(int arg0, String arg1) method to skip using the
TermVectorMapper, but, in this case, it only returns the term frequencies
and not their actual positions in a given document.  I have gutted every
other part in both the process and finishstage methods of the custom
TermVectorComponent to increase speed and it has helped, but, without
breaking solr and lucene open (which I would prefer not to do), I cannot
seem to get this to go any faster.  I have also tried to overrride the
distributed process stage, but that was giving me illegal access errors, so
I left it alone.
What I need is to have the terms, positions and query all at the same place
at the same time.

1) Is there a delicate way to do this at the document level?
2) Can I clear out documents at the process stage in the TermVectorMapper
extension so I don't keep around false matches?
3) Are term vectors stored separately from the documents such that a query
won't know term-vector positions?
2) Has anyone else worked with this problem?

I am aware of BlockJoinQuery and am also exploring it as an alternative,
but I don't want to increase the number of documents stored by 900-2000%
which is what that will do.  Thanks for your time and consideration,

Jim Nolan