You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shouvik Bardhan <sb...@gisfederal.com> on 2015/02/15 17:58:30 UTC

High frequency terms in results document....

Apologies if I have missed it in discussions prior but I looked all over. I
looked at the Luke code and it does find high frequency terms on the entire
index. I am trying to get the top N high frequency terms in the documents
returned from a search result. I came across something called
FilterIndexReader but I don't think it is part of 4.X codebase. Any pointer
is appreciated.

Re: High frequency terms in results document....

Posted by Tomoko Uchida <to...@gmail.com>.

It seems to be the very similar discussion about this topic, I've just
missed it. Number of approaches are there.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3CCAON7oqQh4aXoKfWyn=7oDzWC48h_VvJJaaBpfadmQeHsTzzfRw@mail.gmail.com%3E

> Looks like it goes thru every term and puts them in a priority queue and takes
the top N.

yes, Luke's top N term (and Lucene's PriorityQueue under the food) is great
and the implementation is very good reference.

Regards,
Tomoko



2015-02-19 22:44 GMT+09:00 Shouvik Bardhan <sb...@gisfederal.com>:

> Thanks for your input Uchida. I will try that out. I wonder what is the
> magic sauce in Luke's set of calls which allows it to create say top 100
> terms even from a index with 100 million docs (small docs though for me).
> Looks like it goes thru every term and puts them in a priority queue and
> takes the top N.
>
> regards.
>
> On Thu, Feb 19, 2015 at 2:10 AM, Tomoko Uchida <
> tomoko.uchida.1111@gmail.com
> > wrote:
>
> > Hi,
> >
> > I'm afraid there are no easy or straight way for your requirement.
> > I would try create an temporary tiny index from search results on the fly
> > in memory, and get top N terms from it by HighFreqTerms.
> >
> >
> http://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/misc/HighFreqTerms.html
> > (The logic is almost same to Luke's top N terms feature)
> >
> > I have not tried ant not sure about this is practical approach in
> > performance, just an idea...
> >
> > Hope for it's help
> > Tomoko
> >
> > 2015-02-16 1:58 GMT+09:00 Shouvik Bardhan <sb...@gisfederal.com>:
> >
> > > Apologies if I have missed it in discussions prior but I looked all
> > over. I
> > > looked at the Luke code and it does find high frequency terms on the
> > entire
> > > index. I am trying to get the top N high frequency terms in the
> documents
> > > returned from a search result. I came across something called
> > > FilterIndexReader but I don't think it is part of 4.X codebase. Any
> > pointer
> > > is appreciated.
> > >
> >
>

Re: High frequency terms in results document....

Posted by Shouvik Bardhan <sb...@gisfederal.com>.

Thanks for your input Uchida. I will try that out. I wonder what is the
magic sauce in Luke's set of calls which allows it to create say top 100
terms even from a index with 100 million docs (small docs though for me).
Looks like it goes thru every term and puts them in a priority queue and
takes the top N.

regards.

On Thu, Feb 19, 2015 at 2:10 AM, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> wrote:

> Hi,
>
> I'm afraid there are no easy or straight way for your requirement.
> I would try create an temporary tiny index from search results on the fly
> in memory, and get top N terms from it by HighFreqTerms.
>
> http://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/misc/HighFreqTerms.html
> (The logic is almost same to Luke's top N terms feature)
>
> I have not tried ant not sure about this is practical approach in
> performance, just an idea...
>
> Hope for it's help
> Tomoko
>
> 2015-02-16 1:58 GMT+09:00 Shouvik Bardhan <sb...@gisfederal.com>:
>
> > Apologies if I have missed it in discussions prior but I looked all
> over. I
> > looked at the Luke code and it does find high frequency terms on the
> entire
> > index. I am trying to get the top N high frequency terms in the documents
> > returned from a search result. I came across something called
> > FilterIndexReader but I don't think it is part of 4.X codebase. Any
> pointer
> > is appreciated.
> >
>

Re: High frequency terms in results document....

Posted by Tomoko Uchida <to...@gmail.com>.

Hi,

I'm afraid there are no easy or straight way for your requirement.
I would try create an temporary tiny index from search results on the fly
in memory, and get top N terms from it by HighFreqTerms.
http://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/misc/HighFreqTerms.html
(The logic is almost same to Luke's top N terms feature)

I have not tried ant not sure about this is practical approach in
performance, just an idea...

Hope for it's help
Tomoko

2015-02-16 1:58 GMT+09:00 Shouvik Bardhan <sb...@gisfederal.com>:

> Apologies if I have missed it in discussions prior but I looked all over. I
> looked at the Luke code and it does find high frequency terms on the entire
> index. I am trying to get the top N high frequency terms in the documents
> returned from a search result. I came across something called
> FilterIndexReader but I don't think it is part of 4.X codebase. Any pointer
> is appreciated.
>