You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Adrian Dimulescu <ad...@gmail.com> on 2009/03/16 11:25:49 UTC

number of hits of pages containing two terms

Hello,

I need the number of pages that contain two terms. Only the number of 
hits, I don't care about retrieving the pages. Right now I am using the 
following code in order to get it:


Term first, second;

TermQuery q1 = new TermQuery(first);
TermQuery q2 = new TermQuery(second);

BooleanQuery bQ = new BooleanQuery();
bQ.add(q1, Occur.MUST);
bQ.add(q2, Occur.MUST);

int hitsBoth = indexSearcher.search(bQ, null, 1).totalHits;

For my specific needs of only getting only the hits number, not the 
actual documents, this is very slow (half a second a query) on a recent 
PC. Is there a faster way ?

In order to get the hits for one term, I use indexReader.docFreq(Term) 
-- no actual Query. I wonder if I could do something similar for two Terms.


Thank you
Adrian.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Paul Elschot <pa...@xs4all.nl>.

You may want to try Filters (starting from TermFilter) for this, especially
those based on the default OpenBitSet (see the intersection count method)
because of your interest in stop words.
10k OpenBitSets for 39 M docs will probably not fit in memory in one go,
but that can be worked around by keeping fewer of them in memory.

For non stop words, you could also try using SortedVIntList instead
of OpenBitSet to reduce memory usage. In that case there is no direct
intersection count, but a counting iteration over the intersection can be
still done without actually forming the resulting filter.

Regards,
Paul Elschot


On Tuesday 17 March 2009 12:35:19 Adrian Dimulescu wrote:
> Ian Lea wrote:
> > Adrian - have you looked any further into why your original two term
> > query was too slow?  My experience is that simple queries are usually
> > extremely fast.  
> Let me first point out that it is not "too slow" in absolute terms, it 
> is only for my particular needs of attempting the number of 
> co-occurrences between ideally all non-noise terms (I plan about 10 k x 
> 10 k = 100 million calculations).
> > How large is the index?
> I indexed Wikipedia (the 8GB-XML dump you can download). The index size 
> is 4.4 GB. I have 39 million documents. The particularity is that I cut 
> Wikipedia in pararaphs and I consider each paragraph as a Document (not 
> one page per Document as usual). Which makes a lot of short documents. 
> Each document has a stored Id  and a non-stored analyzed body :
> 
>             doc.add(new Field("id", id, Store.YES, Index.NO));
>             doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
> 
> > How many occurrences of your first or second
> > terms?  
> I do have in my index some words that are usually qualified as "stop" 
> words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use 
> the SnowballAnalyzer in order to lemmatize words.
> 
> My intuition is that the large number of short documents and the fact I 
> am interested in the "stop" words do not help performance.
> 
> Thank you,
> Adrian.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
>

Re: number of hits of pages containing two terms

Posted by Chris Hostetter <ho...@fucit.org>.

: The final "production" computation is one-time, still, I have to recurrently
: come back and correct some errors, then retry...

this doesn't really seem like a problem ideally suited for Lucene ... this 
seems like the type of problem sequential batch crunching could solve 
better...

first pass: tokenize each document into a bucket of words

second pass: count the occurances of every word, and make a list of all 
docs where the occurance is greater then N.

third pass: filter the word buckets from pass#1 so they only contain 
words in the list produced by pass#2

fourth pass: generate all pairs of words in every word bucket produced 
by pass#3

fifth pass: sort and count the uniq pairs produced by pass#4


...i have a hard time thinking in terms of Ma/Reduce steps, but i'm 
guessing a Hadoop based app could do all this in a relatively straight 
forward manner.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Adrian Dimulescu <ad...@gmail.com>.

Michael McCandless wrote:
> Is this a one-time computation?  If so, couldn't you wait a long time
> for the machine to simply finish it?
The final "production" computation is one-time, still, I have to 
recurrently come back and correct some errors, then retry...
>
> With the simple approach (doing 100 million 2-term AND queries), how
> long do you estimate it'd take?
About the estimated time, my existing index is really problematic, I 
should look for ways to optimize it, but I really think analyzer-time 
frequencies should do the job.
> I think you could do this with your own analyzer (as you
> suggested)... it would run normal tokenization, gather all unique
> terms that occurred, discard the "noise" terms (odd to me that you
> don't consider stop words as noise -- or maybe you mean noise (non
> salient terms) at the bigram level?)
By noise I mainly meant terms with frequency 1 (misspelled words, 
garbage escaping my Analyzer). In my current attempts I am really 
interested by common words.

Thanks for the advice,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Michael McCandless <lu...@mikemccandless.com>.

Is this a one-time computation?  If so, couldn't you wait a long time
for the machine to simply finish it?

With the simple approach (doing 100 million 2-term AND queries), how
long do you estimate it'd take?

I think you could do this with your own analyzer (as you
suggested)... it would run normal tokenization, gather all unique
terms that occurred, discard the "noise" terms (odd to me that you
don't consider stop words as noise -- or maybe you mean noise (non
salient terms) at the bigram level?), but then have custom code that
emits "all pairs" of the unique tokens it encountered, which Lucene
then indexes.

The problem is this is an O(N^2) sort of thing, but maybe
co-occurrence constraints of natural language, plus the fact that
you're dealing w/ paragraphs, make this tenable / faster than the
simple approach above.

Mike

Adrian Dimulescu wrote:

> Ian Lea wrote:
>> Adrian - have you looked any further into why your original two term
>> query was too slow?  My experience is that simple queries are usually
>> extremely fast.
> Let me first point out that it is not "too slow" in absolute terms,  
> it is only for my particular needs of attempting the number of co- 
> occurrences between ideally all non-noise terms (I plan about 10 k x  
> 10 k = 100 million calculations).
>> How large is the index?
> I indexed Wikipedia (the 8GB-XML dump you can download). The index  
> size is 4.4 GB. I have 39 million documents. The particularity is  
> that I cut Wikipedia in pararaphs and I consider each paragraph as a  
> Document (not one page per Document as usual). Which makes a lot of  
> short documents. Each document has a stored Id  and a non-stored  
> analyzed body :
>
>           doc.add(new Field("id", id, Store.YES, Index.NO));
>           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
>
>> How many occurrences of your first or second
>> terms?
> I do have in my index some words that are usually qualified as  
> "stop" words. My first two terms are "and" : 13M hits and "s" : 4M  
> hits. I use the SnowballAnalyzer in order to lemmatize words.
>
> My intuition is that the large number of short documents and the  
> fact I am interested in the "stop" words do not help performance.
>
> Thank you,
> Adrian.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Ian Lea <ia...@gmail.com>.

OK - thanks for the explanation.  So this is not just a simple search ...

I'll go away and leave you and Michael and the other experts to talk
about clever solutions.


--
Ian.


On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu
<ad...@gmail.com> wrote:
> Ian Lea wrote:
>>
>> Adrian - have you looked any further into why your original two term
>> query was too slow?  My experience is that simple queries are usually
>> extremely fast.
>
> Let me first point out that it is not "too slow" in absolute terms, it is
> only for my particular needs of attempting the number of co-occurrences
> between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million
> calculations).
>>
>> How large is the index?
>
> I indexed Wikipedia (the 8GB-XML dump you can download). The index size is
> 4.4 GB. I have 39 million documents. The particularity is that I cut
> Wikipedia in pararaphs and I consider each paragraph as a Document (not one
> page per Document as usual). Which makes a lot of short documents. Each
> document has a stored Id  and a non-stored analyzed body :
>
>           doc.add(new Field("id", id, Store.YES, Index.NO));
>           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
>
>> How many occurrences of your first or second
>> terms?
>
> I do have in my index some words that are usually qualified as "stop" words.
> My first two terms are "and" : 13M hits and "s" : 4M hits. I use the
> SnowballAnalyzer in order to lemmatize words.
>
> My intuition is that the large number of short documents and the fact I am
> interested in the "stop" words do not help performance.
>
> Thank you,
> Adrian.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Adrian Dimulescu <ad...@gmail.com>.

Ian Lea wrote:
> Adrian - have you looked any further into why your original two term
> query was too slow?  My experience is that simple queries are usually
> extremely fast.  
Let me first point out that it is not "too slow" in absolute terms, it 
is only for my particular needs of attempting the number of 
co-occurrences between ideally all non-noise terms (I plan about 10 k x 
10 k = 100 million calculations).
> How large is the index?
I indexed Wikipedia (the 8GB-XML dump you can download). The index size 
is 4.4 GB. I have 39 million documents. The particularity is that I cut 
Wikipedia in pararaphs and I consider each paragraph as a Document (not 
one page per Document as usual). Which makes a lot of short documents. 
Each document has a stored Id  and a non-stored analyzed body :

            doc.add(new Field("id", id, Store.YES, Index.NO));
            doc.add(new Field("text", p, Store.NO, Index.ANALYZED));

> How many occurrences of your first or second
> terms?  
I do have in my index some words that are usually qualified as "stop" 
words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use 
the SnowballAnalyzer in order to lemmatize words.

My intuition is that the large number of short documents and the fact I 
am interested in the "stop" words do not help performance.

Thank you,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Ian Lea <ia...@gmail.com>.

This is all getting very complicated!

Adrian - have you looked any further into why your original two term
query was too slow?  My experience is that simple queries are usually
extremely fast.  Standard questions: have you warmed up the searcher?
How large is the index?  How many occurrences of your first or second
terms?  Anything odd about them?  See also
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


--
Ian.


On Tue, Mar 17, 2009 at 11:00 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> Adrian Dimulescu wrote:
>
>> Thank you.
>>
>> I suppose the solution for this is to not create an index but to store
>> co-occurence frequencies at Analyzer level.
>
> I don't understand how this would address the "docFreq does
> not reflect deletions".
>
> You can use the shingles analyzer (under contrib/analyzers)
> to create and index bigrams.  (But the docFreq would still not
> reflect deletions).
>
>> Adrian.
>>
>> On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>>
>>> Be careful: docFreq does not take deletions into account.
>>>
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Adrian Dimulescu <ad...@gmail.com>.

Michael McCandless wrote:
> I don't understand how this would address the "docFreq does
> not reflect deletions".
>
Bad mail-quoting, sorry. I am not interested by document deletion, I 
just index Wikipedia once, and want to get a co-occurrence-based 
similarity distance between words called NGD (normalized google 
distance). No documents get deleted. I wanted to do it the cheap way by 
indexing everything and than counting hits but it's not fast enough (I 
want to calculate all distances between all the words, that is approx 
10,000 x 10,000 comparisons).
> You can use the shingles analyzer (under contrib/analyzers)
> to create and index bigrams.  (But the docFreq would still not
> reflect deletions).
This similarity measure is based on co-occurence within a text window. 
If I'm not mistaken, bigrams require the words to be exactly next to 
each other. But thanks for pointing out that analyzer, I may be 
interested in some n-grams processing later.


Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Michael McCandless <lu...@mikemccandless.com>.

Adrian Dimulescu wrote:

> Thank you.
>
> I suppose the solution for this is to not create an index but to store
> co-occurence frequencies at Analyzer level.

I don't understand how this would address the "docFreq does
not reflect deletions".

You can use the shingles analyzer (under contrib/analyzers)
to create and index bigrams.  (But the docFreq would still not
reflect deletions).

> Adrian.
>
> On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>>
>> Be careful: docFreq does not take deletions into account.
>>

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: number of hits of pages containing two terms

Posted by Adrian Dimulescu <ad...@gmail.com>.

Thank you.

I suppose the solution for this is to not create an index but to store
co-occurence frequencies at Analyzer level.

Adrian.

On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Be careful: docFreq does not take deletions into account.
>

Re: number of hits of pages containing two terms

Posted by Michael McCandless <lu...@mikemccandless.com>.

Adrian Dimulescu wrote:

> Hello,
>
> I need the number of pages that contain two terms. Only the number  
> of hits, I don't care about retrieving the pages. Right now I am  
> using the following code in order to get it:
>
>
> Term first, second;
>
> TermQuery q1 = new TermQuery(first);
> TermQuery q2 = new TermQuery(second);
>
> BooleanQuery bQ = new BooleanQuery();
> bQ.add(q1, Occur.MUST);
> bQ.add(q2, Occur.MUST);
>
> int hitsBoth = indexSearcher.search(bQ, null, 1).totalHits;

You could make a custom HitCollector that simply counts.  But
likely this is not much faster than your call above.

> For my specific needs of only getting only the hits number, not the  
> actual documents, this is very slow (half a second a query) on a  
> recent PC. Is there a faster way ?
>
> In order to get the hits for one term, I use  
> indexReader.docFreq(Term) -- no actual Query. I wonder if I could do  
> something similar for two Terms.

Be careful: docFreq does not take deletions into account.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org