You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Maxim Patramanskij <ma...@osua.de> on 2003/04/22 12:04:37 UTC
Top n words
Hello developers.
I have the following question: is it possible to retrieve 'n' most
often appeared words in the index? What steps I should follow to
fulfill this?
Thanks in advance
Max
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re[2]: Top n words
Posted by Maxim Patramanskij <ma...@osua.de>.
Hello Doug,
Thanks a lot for your feedback, it is exactly what I'm searching for.
:)
Max
DC> Maxim Patramanskij wrote:
>> I have the following question: is it possible to retrieve 'n' most
>> often appeared words in the index? What steps I should follow to
>> fulfill this?
DC> There is a class in the sandbox which does this. Check out:
DC> *http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/miscellaneous/src/java/org/apache/lucene/misc/
DC> Doug
DC> *
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Term highlighting
Posted by Doug Cutting <cu...@lucene.com>.
Jonathan Baxter wrote:
> I have been looking at implementing highlighting of the terms in the
> documents returned by Lucene. I'd rather not have to retokenize the
> document on-the-fly in order to locate the terms, since this is slow
> and wasteful
Have you actually implemented this and found it to be too slow in your
application? I suspect not.
Since most folks only display around 10 hits at a time, it is typically
quite fast to re-tokenize these. Keep in mind that, even if you knew
the positions of the matching tokens you'll need to scan the text of the
document some to construct a context string. And typically you'll not
be interested in showing all of the matches in the document, but only a
handful of the better matches. The practical advantages of knowing
character positions is thus usually quite small.
> - have I missed something obvious and in fact there is a simple way to
> extract term-location information for a specific document from the
> lucene index?
No, Lucene does not provide this.
> - if not, would it be horribly slow to try and do it post-facto after
> hits have been found by scanning through the ".prx" file from the
> start of the information for each term in the query?
Yes, this would be slow, about as slow as running the query again. And
it would only give you the ordinal position of the term, not its
character position.
> - if the answer to the second question is "yes - horribly slow", would
> it make sense then to add an extra field to each entry in the ".frq"
> file indicating where the location information for the term and
> document is in the ".prx" file (ie, the .frq file info for each term
> would consist of a series of <doc_num, freq, prx_pointer_offset>
> triples where prx_pointer_offset gives the number of bytes to skip in
> the .prx file to get to the location information for the specified
> document)? The prx_pointer_offset could then be used in a boolean
> query to compute pointers for each hit indicating where in the .prx
> file the location information for each term starts.
This would nearly double the size of the .frq file, and thus make
searches nearly twice as slow, as they'd have to process double the
data. (Frequency entries only require a couple of bits on average, so
the majority of space in the .frq is document numbers.) And still,
you'd only have the ordinal position.
Also, the bookkeeping and memory required to track and store the
positions of each match would make search a lot slower.
In short, re-tokenizing is the most efficient way to do term
highlighting, especially when you consider the expense of the
alternatives on the rest of the system. There's no point in making
highlighting fast if it makes searches slow.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Term highlighting
Posted by Jonathan Baxter <jb...@panscient.com>.
I have been looking at implementing highlighting of the terms in the
documents returned by Lucene. I'd rather not have to retokenize the
document on-the-fly in order to locate the terms, since this is slow
and wasteful as lucene already has the term-location information (at
least lucene stores the index of the term locations in the document,
which can be turned into a character offset provided you store the
mapping from token positions to character offsets somewhere else - eg
as an unindexed field).
Looking under the hood, it seems from the source that in order to
extract the term location information for a specific document one
would need to scan the ".prx" file sequentially starting at the
offset in the file of the term, until the document number is found.
This probably wouldn't be necessary for a phrase query, since in that
case the .prx file is already being scanned, and so one could just
save a pointer to the start of the location information for each term
in the phrase for each hit.
However, for boolean queries, it is the ".frq" file that is scanned
not the ".prx" file, so there isn't anywhere to get the location
information without rescanning the ".prx" file after finding all the
hits.
So, my question(s):
- have I missed something obvious and in fact there is a simple way to
extract term-location information for a specific document from the
lucene index?
- if not, would it be horribly slow to try and do it post-facto after
hits have been found by scanning through the ".prx" file from the
start of the information for each term in the query?
- if the answer to the second question is "yes - horribly slow", would
it make sense then to add an extra field to each entry in the ".frq"
file indicating where the location information for the term and
document is in the ".prx" file (ie, the .frq file info for each term
would consist of a series of <doc_num, freq, prx_pointer_offset>
triples where prx_pointer_offset gives the number of bytes to skip in
the .prx file to get to the location information for the specified
document)? The prx_pointer_offset could then be used in a boolean
query to compute pointers for each hit indicating where in the .prx
file the location information for each term starts.
Thanks,
Jonathan
--
Jonathan Baxter
jbaxter@panscient.com
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Top n words
Posted by Doug Cutting <cu...@lucene.com>.
Maxim Patramanskij wrote:
> I have the following question: is it possible to retrieve 'n' most
> often appeared words in the index? What steps I should follow to
> fulfill this?
There is a class in the sandbox which does this. Check out:
*http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/miscellaneous/src/java/org/apache/lucene/misc/
Doug
*
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Top n words
Posted by Tobias Kroha <to...@pressline.de>.
Maxim Patramanskij wrote:
> Hello developers.
>
> I have the following question: is it possible to retrieve 'n' most
> often appeared words in the index? What steps I should follow to
> fulfill this?
IndexReader.TermEnum gives you a Enumeration of all terms in the index.
You can generate a sorted list using the method docFeq() of the Enumeration.
hope it helps,
Tobias
>
> Thanks in advance
> Max
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org