You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by manjula wijewickrema <ma...@gmail.com> on 2010/05/06 12:39:58 UTC

Term/Phrase frequencies

Hi,

I am new to Lucene. If I want to know the term or phrase frequency of an
input document, will it be possible through Lucene?

Thanks,
Manjula

Re: Term/Phrase frequencies

Posted by Erick Erickson <er...@gmail.com>.
Well, counting frequency isn't the best approach. For instance, if a field
has 1,000 terms and 10 occurrences of your target, is that a better match
than a field with 10 terms and 5 occurrences of your target?

This kind of thing is already taken into account with Lucene scoring, you
might
want to look at this:
http://lucene.apache.org/java/2_4_0/scoring.html

<http://lucene.apache.org/java/2_4_0/scoring.html>Best
Erick

On Thu, May 6, 2010 at 11:48 PM, manjula wijewickrema
<ma...@gmail.com>wrote:

> Hi Erik,
>
> Thanks for the reply. What I want to do is, to identify key terms and key
> phrases of a document according to their number of occurences in the
> document. Output should be the highest freequency words and (two or three
> word) phrases. For this purpose can I use Lucene?
>
> Thanks
> Manjula
>
> On Thu, May 6, 2010 at 6:09 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Terms are relatively easy, see TermFreqVector in the JavaDocs.
> >
> > Phrases aren't as easy, before you go there, though, what is the
> > high-level problem you're trying to solve? Possibly this is an XY problem
> > (see http://people.apache.org/~hossman/#xyproblem).
> >
> > Best
> > Erick
> >
> > On Thu, May 6, 2010 at 6:39 AM, manjula wijewickrema <
> manjula53@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I am new to Lucene. If I want to know the term or phrase frequency of
> an
> > > input document, will it be possible through Lucene?
> > >
> > > Thanks,
> > > Manjula
> > >
> >
>

Re: Term/Phrase frequencies

Posted by manjula wijewickrema <ma...@gmail.com>.
Hi Erik,

Thanks for the reply. What I want to do is, to identify key terms and key
phrases of a document according to their number of occurences in the
document. Output should be the highest freequency words and (two or three
word) phrases. For this purpose can I use Lucene?

Thanks
Manjula

On Thu, May 6, 2010 at 6:09 PM, Erick Erickson <er...@gmail.com>wrote:

> Terms are relatively easy, see TermFreqVector in the JavaDocs.
>
> Phrases aren't as easy, before you go there, though, what is the
> high-level problem you're trying to solve? Possibly this is an XY problem
> (see http://people.apache.org/~hossman/#xyproblem).
>
> Best
> Erick
>
> On Thu, May 6, 2010 at 6:39 AM, manjula wijewickrema <manjula53@gmail.com
> >wrote:
>
> > Hi,
> >
> > I am new to Lucene. If I want to know the term or phrase frequency of an
> > input document, will it be possible through Lucene?
> >
> > Thanks,
> > Manjula
> >
>

Re: Term/Phrase frequencies

Posted by Erick Erickson <er...@gmail.com>.
Terms are relatively easy, see TermFreqVector in the JavaDocs.

Phrases aren't as easy, before you go there, though, what is the
high-level problem you're trying to solve? Possibly this is an XY problem
(see http://people.apache.org/~hossman/#xyproblem).

Best
Erick

On Thu, May 6, 2010 at 6:39 AM, manjula wijewickrema <ma...@gmail.com>wrote:

> Hi,
>
> I am new to Lucene. If I want to know the term or phrase frequency of an
> input document, will it be possible through Lucene?
>
> Thanks,
> Manjula
>