You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Tobias Kroha <to...@pressline.de> on 2003/04/21 10:12:02 UTC

TermVector

Hi,

is anyone still working on the TermVector support? The last mail 
covering this subject
was from Dmitry and he said he will upload the present code to the 
sandbox, but
I can't find anything there.

I'm willing to help and implement this feature, could someone provide me 
information
about the current status?

bye,
Tobias



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re[2]: Top n words

Posted by Maxim Patramanskij <ma...@osua.de>.
Hello Doug,

Thanks a lot for your feedback, it is exactly what I'm searching for.
:)

Max

DC> Maxim Patramanskij wrote:
>> I have the following question: is it possible to retrieve 'n' most
>> often appeared words in the index? What steps I should follow to
>> fulfill this?

DC> There is a class in the sandbox which does this.  Check out:

DC> *http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/miscellaneous/src/java/org/apache/lucene/misc/

DC> Doug
DC> *


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Term highlighting

Posted by Doug Cutting <cu...@lucene.com>.
Jonathan Baxter wrote:
> I have been looking at implementing highlighting of the terms in the 
> documents returned by Lucene. I'd rather not have to retokenize the 
> document on-the-fly in order to locate the terms, since this is slow 
> and wasteful

Have you actually implemented this and found it to be too slow in your 
application?  I suspect not.

Since most folks only display around 10 hits at a time, it is typically 
quite fast to re-tokenize these.  Keep in mind that, even if you knew 
the positions of the matching tokens you'll need to scan the text of the 
document some to construct a context string.  And typically you'll not 
be interested in showing all of the matches in the document, but only a 
handful of the better matches.  The practical advantages of knowing 
character positions is thus usually quite small.

> - have I missed something obvious and in fact there is a simple way to 
> extract term-location information for a specific document from the 
> lucene index?

No, Lucene does not provide this.

> - if not, would it be horribly slow to try and do it post-facto after 
> hits have been found by scanning through the ".prx" file from the 
> start of the information for each term in the query?

Yes, this would be slow, about as slow as running the query again.  And 
it would only give you the ordinal position of the term, not its 
character position.

> - if the answer to the second question is "yes - horribly slow", would 
> it make sense then to add an extra field to each entry in the ".frq" 
> file indicating where the location information for the term and 
> document is in the ".prx" file (ie, the .frq file info for each term 
> would consist of a series of <doc_num, freq, prx_pointer_offset> 
> triples where prx_pointer_offset gives the number of bytes to skip in 
> the .prx file to get to the location information for the specified 
> document)? The prx_pointer_offset could then be used in a boolean 
> query to compute pointers for each hit indicating where in the .prx 
> file the location information for each term starts. 

This would nearly double the size of the .frq file, and thus make 
searches nearly twice as slow, as they'd have to process double the 
data.  (Frequency entries only require a couple of bits on average, so 
the majority of space in the .frq is document numbers.)  And still, 
you'd only have the ordinal position.

Also, the bookkeeping and memory required to track and store the 
positions of each match would make search a lot slower.

In short, re-tokenizing is the most efficient way to do term 
highlighting, especially when you consider the expense of the 
alternatives on the rest of the system.  There's no point in making 
highlighting fast if it makes searches slow.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Term highlighting

Posted by Jonathan Baxter <jb...@panscient.com>.
I have been looking at implementing highlighting of the terms in the 
documents returned by Lucene. I'd rather not have to retokenize the 
document on-the-fly in order to locate the terms, since this is slow 
and wasteful as lucene already has the term-location information (at 
least lucene stores the index of the term locations in the document, 
which can be turned into a character offset provided you store the 
mapping from token positions to character offsets somewhere else - eg 
as an unindexed field). 

Looking under the hood, it seems from the source that in order to 
extract the term location information for a specific document one 
would need to scan the ".prx" file sequentially starting at the 
offset in the file of the term, until the document number is found. 
This probably wouldn't be necessary for a phrase query, since in that 
case the .prx file is already being scanned, and so one could just 
save a pointer to the start of the location information for each term 
in the phrase for each hit. 

However, for boolean queries, it is the ".frq" file that is scanned 
not the ".prx" file, so there isn't anywhere to get the location 
information without rescanning the ".prx" file after finding all the 
hits. 

So, my question(s):

- have I missed something obvious and in fact there is a simple way to 
extract term-location information for a specific document from the 
lucene index?

- if not, would it be horribly slow to try and do it post-facto after 
hits have been found by scanning through the ".prx" file from the 
start of the information for each term in the query?

- if the answer to the second question is "yes - horribly slow", would 
it make sense then to add an extra field to each entry in the ".frq" 
file indicating where the location information for the term and 
document is in the ".prx" file (ie, the .frq file info for each term 
would consist of a series of <doc_num, freq, prx_pointer_offset> 
triples where prx_pointer_offset gives the number of bytes to skip in 
the .prx file to get to the location information for the specified 
document)? The prx_pointer_offset could then be used in a boolean 
query to compute pointers for each hit indicating where in the .prx 
file the location information for each term starts. 

Thanks,

Jonathan 

--
Jonathan Baxter
jbaxter@panscient.com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Top n words

Posted by Doug Cutting <cu...@lucene.com>.
Maxim Patramanskij wrote:
> I have the following question: is it possible to retrieve 'n' most
> often appeared words in the index? What steps I should follow to
> fulfill this?

There is a class in the sandbox which does this.  Check out:

*http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/miscellaneous/src/java/org/apache/lucene/misc/

Doug
*


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Top n words

Posted by Tobias Kroha <to...@pressline.de>.
Maxim Patramanskij wrote:
> Hello developers.
> 
> I have the following question: is it possible to retrieve 'n' most
> often appeared words in the index? What steps I should follow to
> fulfill this?

IndexReader.TermEnum gives you a Enumeration of all terms in the index.
You can generate a sorted list using the method docFeq() of the Enumeration.

hope it helps,
Tobias


> 
> Thanks in advance
> Max
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Top n words

Posted by Maxim Patramanskij <ma...@osua.de>.
Hello developers.

I have the following question: is it possible to retrieve 'n' most
often appeared words in the index? What steps I should follow to
fulfill this?

Thanks in advance
Max


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: TermVector

Posted by Gregor Heinrich <he...@igd.fhg.de>.
Mille grazie ! Gregor

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Tuesday, April 22, 2003 3:25 PM
To: Lucene Developers List
Subject: RE: TermVector


Hello Gregor (the "other person" :)),

I've added a few relevant links to that bug report:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

I hope this helps.
In one of them Dmitry indicated that he changed his version of the code
some more, so I would also contact Dmitry and ask for the latest
version, if I were you.

Otis

--- Gregor Heinrich <he...@igd.fhg.de> wrote:
> Hello Otis, 
> 
> could you direct me to Dmitry's code? Can't seem to find it...
> 
> Thanks, Gregor
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Tuesday, April 22, 2003 4:20 AM
> To: Lucene Developers List
> Subject: Re: TermVector
> 
> 
> Tobias (and the person who asked about TermVector earlier today and
> expressed interest in working on it),
> 
> As far as I know, what Dmitry sent was the last and most complete
> Term
> Vector support.  It consisted of some changes to Lucene classes and
> creation of additional files in the index.
> The changes looked relatively straight forward, as far as I can
> recall,
> but I never had time to actually patch Lucene and try building it
> with
> Dmitry's changes and new code.
> 
> That said, Eric Isakson recently captured Term Vector support related
> email in a Bugzilla entry:
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927
> 
> If you have time, interest, and knowledge, my suggestion is to just
> grab the latest code from Dmitry, read through it to gain some
> understanding about what Dmitry tried to achieve and how he went
> about
> it, plug it in, see where it breaks, and try fixing it.  If you can
> get
> it to fit into the latest version of Lucene then we can see what
> changes to the core are required, and if there are no drawbacks in
> making them, we'll make them.  If the changes have some ill
> side-effects, you could always host Lucene Term Vector support at
> SourceForge.
> 
> There were a few other people expressing interest in Term Vector
> support and work, including people from Jive Software and the CTO of
> the company I work for.
> 
> Otis
> 
> 
> --- Tobias Kroha <to...@pressline.de> wrote:
> > Hi,
> > 
> > is anyone still working on the TermVector support? The last mail 
> > covering this subject
> > was from Dmitry and he said he will upload the present code to the 
> > sandbox, but
> > I can't find anything there.
> > 
> > I'm willing to help and implement this feature, could someone
> provide
> > me 
> > information
> > about the current status?
> > 
> > bye,
> > Tobias
> > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo
> http://search.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: TermVector

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello Gregor (the "other person" :)),

I've added a few relevant links to that bug report:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

I hope this helps.
In one of them Dmitry indicated that he changed his version of the code
some more, so I would also contact Dmitry and ask for the latest
version, if I were you.

Otis

--- Gregor Heinrich <he...@igd.fhg.de> wrote:
> Hello Otis, 
> 
> could you direct me to Dmitry's code? Can't seem to find it...
> 
> Thanks, Gregor
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Tuesday, April 22, 2003 4:20 AM
> To: Lucene Developers List
> Subject: Re: TermVector
> 
> 
> Tobias (and the person who asked about TermVector earlier today and
> expressed interest in working on it),
> 
> As far as I know, what Dmitry sent was the last and most complete
> Term
> Vector support.  It consisted of some changes to Lucene classes and
> creation of additional files in the index.
> The changes looked relatively straight forward, as far as I can
> recall,
> but I never had time to actually patch Lucene and try building it
> with
> Dmitry's changes and new code.
> 
> That said, Eric Isakson recently captured Term Vector support related
> email in a Bugzilla entry:
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927
> 
> If you have time, interest, and knowledge, my suggestion is to just
> grab the latest code from Dmitry, read through it to gain some
> understanding about what Dmitry tried to achieve and how he went
> about
> it, plug it in, see where it breaks, and try fixing it.  If you can
> get
> it to fit into the latest version of Lucene then we can see what
> changes to the core are required, and if there are no drawbacks in
> making them, we'll make them.  If the changes have some ill
> side-effects, you could always host Lucene Term Vector support at
> SourceForge.
> 
> There were a few other people expressing interest in Term Vector
> support and work, including people from Jive Software and the CTO of
> the company I work for.
> 
> Otis
> 
> 
> --- Tobias Kroha <to...@pressline.de> wrote:
> > Hi,
> > 
> > is anyone still working on the TermVector support? The last mail 
> > covering this subject
> > was from Dmitry and he said he will upload the present code to the 
> > sandbox, but
> > I can't find anything there.
> > 
> > I'm willing to help and implement this feature, could someone
> provide
> > me 
> > information
> > about the current status?
> > 
> > bye,
> > Tobias
> > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo
> http://search.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: TermVector

Posted by Gregor Heinrich <he...@igd.fhg.de>.
Hello Otis, 

could you direct me to Dmitry's code? Can't seem to find it...

Thanks, Gregor

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Tuesday, April 22, 2003 4:20 AM
To: Lucene Developers List
Subject: Re: TermVector


Tobias (and the person who asked about TermVector earlier today and
expressed interest in working on it),

As far as I know, what Dmitry sent was the last and most complete Term
Vector support.  It consisted of some changes to Lucene classes and
creation of additional files in the index.
The changes looked relatively straight forward, as far as I can recall,
but I never had time to actually patch Lucene and try building it with
Dmitry's changes and new code.

That said, Eric Isakson recently captured Term Vector support related
email in a Bugzilla entry:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

If you have time, interest, and knowledge, my suggestion is to just
grab the latest code from Dmitry, read through it to gain some
understanding about what Dmitry tried to achieve and how he went about
it, plug it in, see where it breaks, and try fixing it.  If you can get
it to fit into the latest version of Lucene then we can see what
changes to the core are required, and if there are no drawbacks in
making them, we'll make them.  If the changes have some ill
side-effects, you could always host Lucene Term Vector support at
SourceForge.

There were a few other people expressing interest in Term Vector
support and work, including people from Jive Software and the CTO of
the company I work for.

Otis


--- Tobias Kroha <to...@pressline.de> wrote:
> Hi,
> 
> is anyone still working on the TermVector support? The last mail 
> covering this subject
> was from Dmitry and he said he will upload the present code to the 
> sandbox, but
> I can't find anything there.
> 
> I'm willing to help and implement this feature, could someone provide
> me 
> information
> about the current status?
> 
> bye,
> Tobias
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TermVector

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Tobias (and the person who asked about TermVector earlier today and
expressed interest in working on it),

As far as I know, what Dmitry sent was the last and most complete Term
Vector support.  It consisted of some changes to Lucene classes and
creation of additional files in the index.
The changes looked relatively straight forward, as far as I can recall,
but I never had time to actually patch Lucene and try building it with
Dmitry's changes and new code.

That said, Eric Isakson recently captured Term Vector support related
email in a Bugzilla entry:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

If you have time, interest, and knowledge, my suggestion is to just
grab the latest code from Dmitry, read through it to gain some
understanding about what Dmitry tried to achieve and how he went about
it, plug it in, see where it breaks, and try fixing it.  If you can get
it to fit into the latest version of Lucene then we can see what
changes to the core are required, and if there are no drawbacks in
making them, we'll make them.  If the changes have some ill
side-effects, you could always host Lucene Term Vector support at
SourceForge.

There were a few other people expressing interest in Term Vector
support and work, including people from Jive Software and the CTO of
the company I work for.

Otis


--- Tobias Kroha <to...@pressline.de> wrote:
> Hi,
> 
> is anyone still working on the TermVector support? The last mail 
> covering this subject
> was from Dmitry and he said he will upload the present code to the 
> sandbox, but
> I can't find anything there.
> 
> I'm willing to help and implement this feature, could someone provide
> me 
> information
> about the current status?
> 
> bye,
> Tobias
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: TermVector

Posted by Gregor Heinrich <he...@igd.fhg.de>.
Hi,

TermVector support and explicit access to a Term Doc Matrix would also be
interesting to me. Consequently, I would like to contribute if somebody
starts an effort.

Best regards, Gregor



-----Original Message-----
From: Tobias Kroha [mailto:tobias@pressline.de]
Sent: Monday, April 21, 2003 10:12 AM
To: Lucene Developers List
Subject: TermVector


Hi,

is anyone still working on the TermVector support? The last mail
covering this subject
was from Dmitry and he said he will upload the present code to the
sandbox, but
I can't find anything there.

I'm willing to help and implement this feature, could someone provide me
information
about the current status?

bye,
Tobias



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org