You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2007/04/28 22:19:46 UTC

Luke handler help

I have a few things I'd like to check with the Luke handler, if you call 
could check some of the assumptions, that would be great.

* I want to print out the document frequency for a term in a given 
document.  Since that term shows up in the given document, I would think 
the term frequency must be > 1.  I am using: reader.docFreq( t ) [line 
236] The results seem reasonable, but *sometimes* it returns zero... is 
that possible?

* I want to return the lucene field flags for each field.  I run through 
all the field names with: 
reader.getFieldNames(IndexReader.FieldOption.ALL).  Is there a way to 
get any Fieldable for a given name?  IIUC, all terms with the same name 
will have the same flags.  I tried searching for a document with that 
field, it works, but only for stored fields.

* I just realized that I am only returning stored fields for get 
getDocumentFieldsInfo() (it uses Document.getFields())  How can I get 
find *all* Fieldables for a given document?  I have tried following the 
luke source, but get a bit lost ;)

* Each field gets an boolean attribute "cacheableFaceting" -- this true 
if the number of distinct terms is smaller then the filterCacheSize.  I 
get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size" 
and get the distinctTerm count from counting up the termEnum.  Is this 
logic solid?  I know the cacheability changes if you are faciting 
multiple fields at once, but its still nice to have a ballpark estimate 
without needing to know the internals.


thanks for any pointers
ryan

Re: Luke handler help

Posted by Yonik Seeley <yo...@apache.org>.
> > In an inverted index, terms point to documents.   So you have to
> > traverse *all* of the terms of a field across all documents, and keep
> > track of when you run across the document you are interested in.  When
> > you do, then get the positions that the term appeared at, and keep
> > track of them.  After you have covered all the terms, you can put
> > everything in order.  There could be gaps (positionIncrement, stop
> > word removal, etc) and it's also possible for multiple tokens to
> > appear at the same position.
> >
> > For a full-text field with many terms, and a large index, this could
> > take a *long* time.
> > It's probably very useful for debugging though.

I just realized that it's worse... if you specified a field, then you
only have to iterate the terms for that field.  If you want *all* of
the indexed, non-stored fields for a particular document, but don't
know what they are, there is no info to help you.  You need to iterate
over *all* terms in the index.

Luckily, there is patch in the works in Lucene that will make
skipTo(myDoc) in TermDocs faster.  That should speed things up a
little.

> > Remember that df is not updated when a document is marked for deletion
> > in Lucene.
> > So you can have a df of 2, do a search, and only come up with one document.
> >
>
> that would explain why I'm seeing df > 1 for the uniqueKey!

Yep, that's not likely to ever be fixed in Lucene.  Again, it's the
nature of the inverted index... given a particular docid, you really
have no clue what terms in the index point to that docid.

-Yonik

Re: Luke handler help

Posted by Ryan McKinley <ry...@gmail.com>.
Yonik Seeley wrote:
> On 4/28/07, Ryan McKinley <ry...@gmail.com> wrote:
>> I have a few things I'd like to check with the Luke handler, if you call
>> could check some of the assumptions, that would be great.
>>
>> * I want to print out the document frequency for a term in a given
>> document.  Since that term shows up in the given document, I would think
>> the term frequency must be > 1.  I am using: reader.docFreq( t ) [line
>> 236] The results seem reasonable, but *sometimes* it returns zero... is
>> that possible?
> 
> Is the field indexed?
> Did you run the field through the analyzer to get the terms (to match
> what's in the index)?
> If both of those are true, it seems like the docFreq should always be
> greater than 0.
> 

aah, that makes sense - now that you mention it, I only see df=0 for 
non-indexed, stored fields.


> 
> In an inverted index, terms point to documents.   So you have to
> traverse *all* of the terms of a field across all documents, and keep
> track of when you run across the document you are interested in.  When
> you do, then get the positions that the term appeared at, and keep
> track of them.  After you have covered all the terms, you can put
> everything in order.  There could be gaps (positionIncrement, stop
> word removal, etc) and it's also possible for multiple tokens to
> appear at the same position.
> 
> For a full-text field with many terms, and a large index, this could
> take a *long* time.
> It's probably very useful for debugging though.
> 

that must be why luke starts a new thread for 'reconstruct and edit' 
For now, i will leave this out of the handler, and leave that open to 
someone with the need/time in the future.


>> * Each field gets an boolean attribute "cacheableFaceting" -- this true
>> if the number of distinct terms is smaller then the filterCacheSize.  I
>> get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size"
>> and get the distinctTerm count from counting up the termEnum.  Is this
>> logic solid?  I know the cacheability changes if you are faciting
>> multiple fields at once, but its still nice to have a ballpark estimate
>> without needing to know the internals.
> 
> It could get trickier... I'm about to hack up a quick patch now that
> will reduce memory usage by only using the filterCache  above a
> certain df threshold.  It may increase or
> decrease the faceting speed - TBD.
> 
> Also, other alternate faceting schemes are in the works (a month or two 
> out).
> I'd leave this attribute out and just report on the number of unique terms.

ok, that seems reasonable.


> Some kind of histogram might be really nice though (how many terms
> under varying df values):
>  1=>412  (412 terms have a df of 1)
>  2=>516  (516 terms have a df of 2)
>  4=>600
>  8=>650
> 16=>670
> 32=>680
> 64=>683
> 128=>685
> 256=>686
> 11325=>690  (the maxDf found)
> 

I'll take a look at that


> Remember that df is not updated when a document is marked for deletion
> in Lucene.
> So you can have a df of 2, do a search, and only come up with one document.
> 

that would explain why I'm seeing df > 1 for the uniqueKey!


Re: Luke handler help

Posted by Yonik Seeley <yo...@apache.org>.
On 4/28/07, Ryan McKinley <ry...@gmail.com> wrote:
> I have a few things I'd like to check with the Luke handler, if you call
> could check some of the assumptions, that would be great.
>
> * I want to print out the document frequency for a term in a given
> document.  Since that term shows up in the given document, I would think
> the term frequency must be > 1.  I am using: reader.docFreq( t ) [line
> 236] The results seem reasonable, but *sometimes* it returns zero... is
> that possible?

Is the field indexed?
Did you run the field through the analyzer to get the terms (to match
what's in the index)?
If both of those are true, it seems like the docFreq should always be
greater than 0.

> * I want to return the lucene field flags for each field.  I run through
> all the field names with:
> reader.getFieldNames(IndexReader.FieldOption.ALL).  Is there a way to
> get any Fieldable for a given name?  IIUC, all terms with the same name
> will have the same flags.  I tried searching for a document with that
> field, it works, but only for stored fields.
>
> * I just realized that I am only returning stored fields for get
> getDocumentFieldsInfo() (it uses Document.getFields())  How can I get
> find *all* Fieldables for a given document?  I have tried following the
> luke source, but get a bit lost ;)

LOL... if it's an inverted index, it's difficult and time consuming to
try and reconstruct what a non-stored field value was.

In an inverted index, terms point to documents.   So you have to
traverse *all* of the terms of a field across all documents, and keep
track of when you run across the document you are interested in.  When
you do, then get the positions that the term appeared at, and keep
track of them.  After you have covered all the terms, you can put
everything in order.  There could be gaps (positionIncrement, stop
word removal, etc) and it's also possible for multiple tokens to
appear at the same position.

For a full-text field with many terms, and a large index, this could
take a *long* time.
It's probably very useful for debugging though.

> * Each field gets an boolean attribute "cacheableFaceting" -- this true
> if the number of distinct terms is smaller then the filterCacheSize.  I
> get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size"
> and get the distinctTerm count from counting up the termEnum.  Is this
> logic solid?  I know the cacheability changes if you are faciting
> multiple fields at once, but its still nice to have a ballpark estimate
> without needing to know the internals.

It could get trickier... I'm about to hack up a quick patch now that
will reduce memory usage by only using the filterCache  above a
certain df threshold.  It may increase or
decrease the faceting speed - TBD.

Also, other alternate faceting schemes are in the works (a month or two out).
I'd leave this attribute out and just report on the number of unique terms.
Some kind of histogram might be really nice though (how many terms
under varying df values):
  1=>412  (412 terms have a df of 1)
  2=>516  (516 terms have a df of 2)
  4=>600
  8=>650
 16=>670
 32=>680
 64=>683
128=>685
256=>686
11325=>690  (the maxDf found)

Remember that df is not updated when a document is marked for deletion
in Lucene.
So you can have a df of 2, do a search, and only come up with one document.

-Yonik