You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "mike.schultz" <mi...@gmail.com> on 2009/09/04 18:34:14 UTC

capturing field length into a stored document field

For various statistics I collect from an index it's important for me to know
the length (measured in tokens) of a document field.  I can get that
information to some degree from the "norms" for the field but a) the
resolution isn't that great, and b) more importantly, if boosts are used
it's almost impossible to get lengths from this.

Here's two ideas I was thinking about that maybe some can comment on.

1) Use copyto to copy the field in question, fieldA to an addition field,
fieldALength, which has an extra filter that just counts the tokens and only
outputs a token representing the length of the field.  This has the
disadvantage of retokenizing basically the whole document (because the field
in question is basically the body).  Plus I would think littering the term
space with these tokens might be bad for performance, I'm not sure.

2) Add a filter to the field in question which again counts the tokens. 
This filter allows the regular tokens to be indexed as usual but somehow
manages to get the token-count into a stored field of the document.  This
has the advantage of not having to retokenize the field and instead of
littering the token space, the count becomes docdata for each doc.  Can this
be done?  Maybe using threadLocal to temporarily store the count?

Thanks.

-- 
View this message in context: http://www.nabble.com/capturing-field-length-into-a-stored-document-field-tp25297690p25297690.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: capturing field length into a stored document field

Posted by Grant Ingersoll <gs...@apache.org>.

The Similarity.lengthNorm() is a callback from Lucene that gives you  
the information you seek.  Of course, the trick still is how to use  
that.  Perhaps you can describe a bit more about why you need that  
length.

On Sep 4, 2009, at 11:34 AM, mike.schultz wrote:

>
> For various statistics I collect from an index it's important for me  
> to know
> the length (measured in tokens) of a document field.  I can get that
> information to some degree from the "norms" for the field but a) the
> resolution isn't that great, and b) more importantly, if boosts are  
> used
> it's almost impossible to get lengths from this.
>
> Here's two ideas I was thinking about that maybe some can comment on.
>
> 1) Use copyto to copy the field in question, fieldA to an addition  
> field,
> fieldALength, which has an extra filter that just counts the tokens  
> and only
> outputs a token representing the length of the field.  This has the
> disadvantage of retokenizing basically the whole document (because  
> the field
> in question is basically the body).  Plus I would think littering  
> the term
> space with these tokens might be bad for performance, I'm not sure.
>
> 2) Add a filter to the field in question which again counts the  
> tokens.
> This filter allows the regular tokens to be indexed as usual but  
> somehow
> manages to get the token-count into a stored field of the document.   
> This
> has the advantage of not having to retokenize the field and instead of
> littering the token space, the count becomes docdata for each doc.   
> Can this
> be done?  Maybe using threadLocal to temporarily store the count?
>
> Thanks.
>
> -- 
> View this message in context: http://www.nabble.com/capturing-field-length-into-a-stored-document-field-tp25297690p25297690.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: capturing field length into a stored document field

Posted by "mike.schultz" <mi...@gmail.com>.

Here's a hybrid solution.  Add a filter to the field in question that counts
all the tokens and at the end  outputs a token of the form
__numtokens.<numTokens>__.  This eliminates the need to retokenize the field
again.  Also, bucket the numbers, either by some factor of ten, or base 2,
so that there aren't so many different token types produced.  This has a
space advantage over storing in a field, especially since the information
isn't needed at query time anyway.



mike.schultz wrote:
> 
> For various statistics I collect from an index it's important for me to
> know the length (measured in tokens) of a document field.  I can get that
> information to some degree from the "norms" for the field but a) the
> resolution isn't that great, and b) more importantly, if boosts are used
> it's almost impossible to get lengths from this.
> 
> Here's two ideas I was thinking about that maybe some can comment on.
> 
> 1) Use copyto to copy the field in question, fieldA to an addition field,
> fieldALength, which has an extra filter that just counts the tokens and
> only outputs a token representing the length of the field.  This has the
> disadvantage of retokenizing basically the whole document (because the
> field in question is basically the body).  Plus I would think littering
> the term space with these tokens might be bad for performance, I'm not
> sure.
> 
> 2) Add a filter to the field in question which again counts the tokens. 
> This filter allows the regular tokens to be indexed as usual but somehow
> manages to get the token-count into a stored field of the document.  This
> has the advantage of not having to retokenize the field and instead of
> littering the token space, the count becomes docdata for each doc.  Can
> this be done?  Maybe using threadLocal to temporarily store the count?
> 
> Thanks.
> 
> 

-- 
View this message in context: http://www.nabble.com/capturing-field-length-into-a-stored-document-field-tp25297690p25339584.html
Sent from the Solr - User mailing list archive at Nabble.com.