You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Maxim Patramanskij <ma...@osua.de> on 2003/04/22 12:04:37 UTC

Top n words

Hello developers.

I have the following question: is it possible to retrieve 'n' most
often appeared words in the index? What steps I should follow to
fulfill this?

Thanks in advance
Max


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re[2]: Top n words

Posted by Maxim Patramanskij <ma...@osua.de>.

Hello Doug,

Thanks a lot for your feedback, it is exactly what I'm searching for.
:)

Max

DC> Maxim Patramanskij wrote:
>> I have the following question: is it possible to retrieve 'n' most
>> often appeared words in the index? What steps I should follow to
>> fulfill this?

DC> There is a class in the sandbox which does this.  Check out:

DC> *http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/miscellaneous/src/java/org/apache/lucene/misc/

DC> Doug
DC> *


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Term highlighting

Posted by Doug Cutting <cu...@lucene.com>.

Jonathan Baxter wrote:
> I have been looking at implementing highlighting of the terms in the 
> documents returned by Lucene. I'd rather not have to retokenize the 
> document on-the-fly in order to locate the terms, since this is slow 
> and wasteful

Have you actually implemented this and found it to be too slow in your 
application?  I suspect not.

Since most folks only display around 10 hits at a time, it is typically 
quite fast to re-tokenize these.  Keep in mind that, even if you knew 
the positions of the matching tokens you'll need to scan the text of the 
document some to construct a context string.  And typically you'll not 
be interested in showing all of the matches in the document, but only a 
handful of the better matches.  The practical advantages of knowing 
character positions is thus usually quite small.

> - have I missed something obvious and in fact there is a simple way to 
> extract term-location information for a specific document from the 
> lucene index?

No, Lucene does not provide this.

> - if not, would it be horribly slow to try and do it post-facto after 
> hits have been found by scanning through the ".prx" file from the 
> start of the information for each term in the query?

Yes, this would be slow, about as slow as running the query again.  And 
it would only give you the ordinal position of the term, not its 
character position.

> - if the answer to the second question is "yes - horribly slow", would 
> it make sense then to add an extra field to each entry in the ".frq" 
> file indicating where the location information for the term and 
> document is in the ".prx" file (ie, the .frq file info for each term 
> would consist of a series of <doc_num, freq, prx_pointer_offset> 
> triples where prx_pointer_offset gives the number of bytes to skip in 
> the .prx file to get to the location information for the specified 
> document)? The prx_pointer_offset could then be used in a boolean 
> query to compute pointers for each hit indicating where in the .prx 
> file the location information for each term starts. 

This would nearly double the size of the .frq file, and thus make 
searches nearly twice as slow, as they'd have to process double the 
data.  (Frequency entries only require a couple of bits on average, so 
the majority of space in the .frq is document numbers.)  And still, 
you'd only have the ordinal position.

Also, the bookkeeping and memory required to track and store the 
positions of each match would make search a lot slower.

In short, re-tokenizing is the most efficient way to do term 
highlighting, especially when you consider the expense of the 
alternatives on the rest of the system.  There's no point in making 
highlighting fast if it makes searches slow.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Term highlighting

Posted by Jonathan Baxter <jb...@panscient.com>.

I have been looking at implementing highlighting of the terms in the 
documents returned by Lucene. I'd rather not have to retokenize the 
document on-the-fly in order to locate the terms, since this is slow 
and wasteful as lucene already has the term-location information (at 
least lucene stores the index of the term locations in the document, 
which can be turned into a character offset provided you store the 
mapping from token positions to character offsets somewhere else - eg 
as an unindexed field). 

Looking under the hood, it seems from the source that in order to 
extract the term location information for a specific document one 
would need to scan the ".prx" file sequentially starting at the 
offset in the file of the term, until the document number is found. 
This probably wouldn't be necessary for a phrase query, since in that 
case the .prx file is already being scanned, and so one could just 
save a pointer to the start of the location information for each term 
in the phrase for each hit. 

However, for boolean queries, it is the ".frq" file that is scanned 
not the ".prx" file, so there isn't anywhere to get the location 
information without rescanning the ".prx" file after finding all the 
hits. 

So, my question(s):

- have I missed something obvious and in fact there is a simple way to 
extract term-location information for a specific document from the 
lucene index?

- if not, would it be horribly slow to try and do it post-facto after 
hits have been found by scanning through the ".prx" file from the 
start of the information for each term in the query?

- if the answer to the second question is "yes - horribly slow", would 
it make sense then to add an extra field to each entry in the ".frq" 
file indicating where the location information for the term and 
document is in the ".prx" file (ie, the .frq file info for each term 
would consist of a series of <doc_num, freq, prx_pointer_offset> 
triples where prx_pointer_offset gives the number of bytes to skip in 
the .prx file to get to the location information for the specified 
document)? The prx_pointer_offset could then be used in a boolean 
query to compute pointers for each hit indicating where in the .prx 
file the location information for each term starts. 

Thanks,

Jonathan 

--
Jonathan Baxter
jbaxter@panscient.com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Top n words

Posted by Doug Cutting <cu...@lucene.com>.

Maxim Patramanskij wrote:
> I have the following question: is it possible to retrieve 'n' most
> often appeared words in the index? What steps I should follow to
> fulfill this?

There is a class in the sandbox which does this.  Check out:

*http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/miscellaneous/src/java/org/apache/lucene/misc/

Doug
*


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Top n words

Posted by Tobias Kroha <to...@pressline.de>.

Maxim Patramanskij wrote:
> Hello developers.
> 
> I have the following question: is it possible to retrieve 'n' most
> often appeared words in the index? What steps I should follow to
> fulfill this?

IndexReader.TermEnum gives you a Enumeration of all terms in the index.
You can generate a sorted list using the method docFeq() of the Enumeration.

hope it helps,
Tobias


> 
> Thanks in advance
> Max
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org