You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Allan Hill <pa...@metajure.com> on 2012/02/04 01:09:57 UTC

recording a universal ID from DocID in a CustomScoreQuery

My Index does NOT have a simple UID, it uses the file PATH to the file as the unique key.
I was implementing a CustomScoreQuery which not only tweaked the score it also wanted to write down which documents had passed through this part of overall rebuilt query, so that I could further mess with those particular documents later.
I was hoping to do it without using loading up all PATHs from my index into a field cache, but maybe that is a false way to try to save memory.

I thought I could write down the docId provided in the call to customScore

public float customScore(int doc, float subQueryScore, float valSrcScore) throws IOException {
     docIds.add(docId);
   return ...;
  }

private Set<Integer> docIds = new HashSet<Integer>();

While I thought I had this working, apparently I had not taken into consideration the subreader and segment problem.
The int called doc is not the docId for the entire index, just the local reader doc number.  Is that right?
So is there a standard way to convert back to the index wide DocID?

If there is no standard way, I _might_ create a small subclass of IndexSearcher and provide a method to:


(1)    Find the right reader by looping through all IndexSearcher.subReaders[] to find what reader called the CustomScoreQuery

(2)    Add an offset of the proper value from IndexSearcher.docStarts[iReader]

But I'm am thinking this prone to the problem that subreader can be made of more subreaders etc., so I really don't have a clue where to find the current reader and then to map back to
docStarts.

I also think I'm doing this wrong, because ReaderUtil has nothing like this?

Is there some way to note for later that a particular document came through this function query or should I just accept the fact of using the field cache?

-Paul





RE: recording a universal ID from DocID in a CustomScoreQuery

Posted by Paul Allan Hill <pa...@metajure.com>.
To complete this thread, I read the document itself with a 1 field fieldSelector, so as not to bother with anything but exactly what I needed at this point in the code (particular not the text body).

Then I saved the primary key (the path) of documents that visited this CustomScoreQuery (function query) in a Set<String> seenDocs
                seenDocs.add(reader.document(docId, fieldSelector ).getFieldable(KEY_FIELD).stringValue());

If We do introduce a short global unique ID field, the code needs little change to move to a different field.

When the entire query rounded up all the results, It asks the question which ones had come through that function query by consulting the list of seenDocs.

I decided NOT to use the fieldcache for this particular application, because the number of documents that are the result of this part of the query are very small compared to all documents
Their rarity was the point of knowing, so that I could mark the result as 'special' for other parts of the application.  Such special documents get different treatment in the UI, but that's not my concern, just IDing which ones was the useful part for index layer.

As usual thanks for the feedback.

-Paul

> -----Original Message-----
> From: Ian Lea [mailto:ian.lea@gmail.com]
> Sent: Monday, February 06, 2012 3:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: recording a universal ID from DocID in a CustomScoreQuery
> 
> int doc will be for the subreader, not for the entire index.
> oal.search.Collector has setNextReader(IndexReader reader, int
> docBase) which you might somehow be able to use.  Failing that I'd go for FieldCache, or store the
> docids in a Set in a Map keyed by current Reader, if that would give you what you needed for the
> subsequent messing around.
> 
> 
> --
> Ian.
> 
> 
> On Sat, Feb 4, 2012 at 12:09 AM, Paul Allan Hill <pa...@metajure.com> wrote:
> > My Index does NOT have a simple UID, it uses the file PATH to the file as the unique key.
> > I was implementing a CustomScoreQuery which not only tweaked the score it also wanted to write
> down which documents had passed through this part of overall rebuilt query, so that I could further
> mess with those particular documents later.
> > I was hoping to do it without using loading up all PATHs from my index into a field cache, but maybe
> that is a false way to try to save memory.
> >
> > I thought I could write down the docId provided in the call to
> > customScore
> >
> > public float customScore(int doc, float subQueryScore, float
> > valSrcScore) throws IOException {
> >     docIds.add(docId);
> >   return ...;
> >  }
> >
> > private Set<Integer> docIds = new HashSet<Integer>();
> >
> > While I thought I had this working, apparently I had not taken into consideration the subreader and
> segment problem.
> > The int called doc is not the docId for the entire index, just the local reader doc number.  Is that
> right?
> > So is there a standard way to convert back to the index wide DocID?
> >
> > If there is no standard way, I _might_ create a small subclass of IndexSearcher and provide a method
> to:
> >
> >
> > (1)    Find the right reader by looping through all
> > IndexSearcher.subReaders[] to find what reader called the
> > CustomScoreQuery
> >
> > (2)    Add an offset of the proper value from
> > IndexSearcher.docStarts[iReader]
> >
> > But I'm am thinking this prone to the problem that subreader can be
> > made of more subreaders etc., so I really don't have a clue where to find the current reader and
> then to map back to docStarts.
> >
> > I also think I'm doing this wrong, because ReaderUtil has nothing like this?
> >
> > Is there some way to note for later that a particular document came through this function query or
> should I just accept the fact of using the field cache?
> >
> > -Paul
> >
> >
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: recording a universal ID from DocID in a CustomScoreQuery

Posted by Ian Lea <ia...@gmail.com>.
int doc will be for the subreader, not for the entire index.
oal.search.Collector has setNextReader(IndexReader reader, int
docBase) which you might somehow be able to use.  Failing that I'd go
for FieldCache, or store the docids in a Set in a Map keyed by current
Reader, if that would give you what you needed for the subsequent
messing around.


--
Ian.


On Sat, Feb 4, 2012 at 12:09 AM, Paul Allan Hill <pa...@metajure.com> wrote:
> My Index does NOT have a simple UID, it uses the file PATH to the file as the unique key.
> I was implementing a CustomScoreQuery which not only tweaked the score it also wanted to write down which documents had passed through this part of overall rebuilt query, so that I could further mess with those particular documents later.
> I was hoping to do it without using loading up all PATHs from my index into a field cache, but maybe that is a false way to try to save memory.
>
> I thought I could write down the docId provided in the call to customScore
>
> public float customScore(int doc, float subQueryScore, float valSrcScore) throws IOException {
>     docIds.add(docId);
>   return ...;
>  }
>
> private Set<Integer> docIds = new HashSet<Integer>();
>
> While I thought I had this working, apparently I had not taken into consideration the subreader and segment problem.
> The int called doc is not the docId for the entire index, just the local reader doc number.  Is that right?
> So is there a standard way to convert back to the index wide DocID?
>
> If there is no standard way, I _might_ create a small subclass of IndexSearcher and provide a method to:
>
>
> (1)    Find the right reader by looping through all IndexSearcher.subReaders[] to find what reader called the CustomScoreQuery
>
> (2)    Add an offset of the proper value from IndexSearcher.docStarts[iReader]
>
> But I'm am thinking this prone to the problem that subreader can be made of more subreaders etc., so I really don't have a clue where to find the current reader and then to map back to
> docStarts.
>
> I also think I'm doing this wrong, because ReaderUtil has nothing like this?
>
> Is there some way to note for later that a particular document came through this function query or should I just accept the fact of using the field cache?
>
> -Paul
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org