You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by hrishim <sm...@yahoo.co.in> on 2010/01/08 11:16:52 UTC

Term Frequency for phrases

Hi .
I have phrases like brain natriuretic peptide indexed as a single token
using Lucene.
When I calculate the term frequency for the same  the count is 0 since the
tokens from the text are indexed separately i.e. brain , natriuretic ,
peptide.
Is there a way to solve this problem and get the term frequency for the
entire phrase ?

Regards,
Hrishi
-- 
View this message in context: http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term Frequency for phrases

Posted by Erick Erickson <er...@gmail.com>.

On a quick read, your statements are contradictory....

<<<I have phrases like brain natriuretic peptide indexed as a single
token>>>

<<<When I calculate the term frequency for the same  the count is 0 since
the
tokens from the text are indexed separately i.e. brain , natriuretic ,
peptide.>>>

Either "brain natriuretic peptide" is a single token/term or it's not....

Are you sure you're not confusing indexing and storing? What
analyzer are you using at index time?

Erick

On Fri, Jan 8, 2010 at 5:16 AM, hrishim <sm...@yahoo.co.in> wrote:

>
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single token
> using Lucene.
> When I calculate the term frequency for the same  the count is 0 since the
> tokens from the text are indexed separately i.e. brain , natriuretic ,
> peptide.
> Is there a way to solve this problem and get the term frequency for the
> entire phrase ?
>
> Regards,
> Hrishi
> --
> View this message in context:
> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Term Frequency for phrases

Posted by Jason Rutherglen <ja...@gmail.com>.

I'm not going to go into too much code level detail, however I'd index
the phrases using tri-gram shingles, and as uni-grams.  I think
this'll give you the results you're looking for.  You'll be able to
quickly recall the count of a given phrase aka tri-gram such as
"blue_shorts_burough"

On Fri, Jan 8, 2010 at 9:37 AM, hrishim <sm...@yahoo.co.in> wrote:
>
> @All : Elaborating the problem
>
> The phrase is being indexed as a single token ...
> I have a Gene tag in the xml document which is like
> <Gene>brain natriuretic peptide </Gene>
> This phrase is  present in the abstract text for the given document .
>
> Code is as :
>
> doc.add(new Field("Gene", geneName, Field.Store.YES,
> Field.Index.ANALYZED,Field.TermVector.YES));
>
> doc.add(new Field("Token", abstractText.toString().toLowerCase(),
> Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.YES));
>
> When I retrieve all tokens as well as genes for a given doc and calculate
> the tf for each of these ,
> a null exception is thrown . Term = brain natriuretic peptide
>
> TermDocs termDocs = indexReader.termDocs(term);
> termDocs.next();
> double tf = termDocs.freq();
>
> Regards,
> Hrishi
>
>
> Grant Ingersoll-6 wrote:
>>
>> When do you detect that they are phrases?  During indexing or during
>> search?
>>
>> On Jan 8, 2010, at 5:16 AM, hrishim wrote:
>>
>>>
>>> Hi .
>>> I have phrases like brain natriuretic peptide indexed as a single token
>>> using Lucene.
>>> When I calculate the term frequency for the same  the count is 0 since
>>> the
>>> tokens from the text are indexed separately i.e. brain , natriuretic ,
>>> peptide.
>>> Is there a way to solve this problem and get the term frequency for the
>>> entire phrase ?
>>>
>>> Regards,
>>> Hrishi
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27079648.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term Frequency for phrases

Posted by Erick Erickson <er...@gmail.com>.

What are the associated Analyzers for your Gene and Token?
Because if they're NOT something akin to KeywordAnalyzer, you
have a problem. Specifically, most of the "regular" tokenizers will
break this stream up into three separate terms,
"brain", "natriuetic", and "peptide". If that's the case,  there is no
single term in your index "brain natriuetic peptide".

I'm assuming that your high-level task is to answer "how many times
does the phrase 'brain natriuetic peptide' appear in the index (or maybe
doc)", right?

I really recommend that you get a copy of Luke and examine what's
actually in your index, it's invaluable.....

See Jason's e-mail for another approach....

HTH
Erick


On Fri, Jan 8, 2010 at 12:37 PM, hrishim <sm...@yahoo.co.in> wrote:

>
> @All : Elaborating the problem
>
> The phrase is being indexed as a single token ...
> I have a Gene tag in the xml document which is like
> <Gene>brain natriuretic peptide </Gene>
> This phrase is  present in the abstract text for the given document .
>
> Code is as :
>
> doc.add(new Field("Gene", geneName, Field.Store.YES,
> Field.Index.ANALYZED,Field.TermVector.YES));
>
> doc.add(new Field("Token", abstractText.toString().toLowerCase(),
> Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.YES));
>
> When I retrieve all tokens as well as genes for a given doc and calculate
> the tf for each of these ,
> a null exception is thrown . Term = brain natriuretic peptide
>
> TermDocs termDocs = indexReader.termDocs(term);
> termDocs.next();
> double tf = termDocs.freq();
>
> Regards,
> Hrishi
>
>
> Grant Ingersoll-6 wrote:
> >
> > When do you detect that they are phrases?  During indexing or during
> > search?
> >
> > On Jan 8, 2010, at 5:16 AM, hrishim wrote:
> >
> >>
> >> Hi .
> >> I have phrases like brain natriuretic peptide indexed as a single token
> >> using Lucene.
> >> When I calculate the term frequency for the same  the count is 0 since
> >> the
> >> tokens from the text are indexed separately i.e. brain , natriuretic ,
> >> peptide.
> >> Is there a way to solve this problem and get the term frequency for the
> >> entire phrase ?
> >>
> >> Regards,
> >> Hrishi
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27079648.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Term Frequency for phrases

Posted by hrishim <sm...@yahoo.co.in>.

@All : Elaborating the problem

The phrase is being indexed as a single token ...
I have a Gene tag in the xml document which is like
<Gene>brain natriuretic peptide </Gene>
This phrase is  present in the abstract text for the given document .
 
Code is as :

doc.add(new Field("Gene", geneName, Field.Store.YES,
Field.Index.ANALYZED,Field.TermVector.YES));

doc.add(new Field("Token", abstractText.toString().toLowerCase(),
Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.YES));

When I retrieve all tokens as well as genes for a given doc and calculate
the tf for each of these , 
a null exception is thrown . Term = brain natriuretic peptide 

TermDocs termDocs = indexReader.termDocs(term);
termDocs.next();
double tf = termDocs.freq();

Regards,
Hrishi


Grant Ingersoll-6 wrote:
> 
> When do you detect that they are phrases?  During indexing or during
> search?
> 
> On Jan 8, 2010, at 5:16 AM, hrishim wrote:
> 
>> 
>> Hi .
>> I have phrases like brain natriuretic peptide indexed as a single token
>> using Lucene.
>> When I calculate the term frequency for the same  the count is 0 since
>> the
>> tokens from the text are indexed separately i.e. brain , natriuretic ,
>> peptide.
>> Is there a way to solve this problem and get the term frequency for the
>> entire phrase ?
>> 
>> Regards,
>> Hrishi
>> -- 
>> View this message in context:
>> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27079648.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term Frequency for phrases

Posted by Grant Ingersoll <gs...@apache.org>.

When do you detect that they are phrases?  During indexing or during search?

On Jan 8, 2010, at 5:16 AM, hrishim wrote:

> 
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single token
> using Lucene.
> When I calculate the term frequency for the same  the count is 0 since the
> tokens from the text are indexed separately i.e. brain , natriuretic ,
> peptide.
> Is there a way to solve this problem and get the term frequency for the
> entire phrase ?
> 
> Regards,
> Hrishi
> -- 
> View this message in context: http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term Frequency for phrases

Posted by Michael McCandless <lu...@mikemccandless.com>.

Issue a PhraseQuery and count how many hits came back?  Is that too
slow?  If so, you could detect all phrases during indexing and add
them as tokens to the index?

Mike

On Fri, Jan 8, 2010 at 5:16 AM, hrishim <sm...@yahoo.co.in> wrote:
>
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single token
> using Lucene.
> When I calculate the term frequency for the same  the count is 0 since the
> tokens from the text are indexed separately i.e. brain , natriuretic ,
> peptide.
> Is there a way to solve this problem and get the term frequency for the
> entire phrase ?
>
> Regards,
> Hrishi
> --
> View this message in context: http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org