You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by An...@csiro.au on 2004/12/23 07:50:38 UTC

Word co-occurrences counts

Hi all,

I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.

 

The problem requires two abilities:

1.	To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader) 
2.	To be able to return the number of word co-occurrences within
the document set (ie. How many times does "computer" appear within 50
words of  "dog") 

 

Is the second point possible?

 

Thanks all, and happy holidays,

Andrew

Re: Word co-occurrences counts

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 24, 2004, at 12:40 AM, Andrew Cunningham wrote:
> 3) and then:
>        word in document count = 
> hits.score(k)/Similarity.decodeNorm(reader.norms("contents")[k])

You should use hits.id(k), not k, as the index to 
reader.norms("contents").

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Andrew Cunningham <cu...@csiro.au>.

Thanks Doug and all,

I'm intending to use Lucene to grab a lot of word co-occurance 
statistics out of a large corpus
to perform word disambiguation. Lucene's looking like a great option, 
but I appear to have hit
a snag. Here's my understanding:

1) Create a Similarity implementation, where:
        tf() returns freq
    sloppyFreq, idf, coord, return 1 (cause we only need to freq to score)
2) Perform the query
3) and then:
        word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms("contents")[k])
4) A query call such as
        "computer dog"~50
    will return a count of 2 (I assume because the match occurs 
backwards and forwards).

My problem occurs when I have the following in a text file:
    computer ...(some words)... dog ...(some words)... computer
and I duplicate the text file several times over. Performing a the above 
query will return different
phrase counts per document?

Note: I'm just working with some modified demo code at the moment.

Thanks again,
Andrew


Doug Cutting wrote:

> Andrew Cunningham wrote:
>
>> "computer dog"~50 looks like what I'm after - now is there someway I 
>> can call this and pull
>> out the number of total occurances, not just the number of documents 
>> hits? (say if computer
>> and dog occur near each other several times in the same document).
>
>
> You could use a custom Similarity implementation for this query, where 
> tf() is the identity function, idf() returns 1.0, etc., so that the 
> final score is the occurance count.  You'll need to divide by 
> Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to 
> get rid of the lengthNorm() and field boost (if any).
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Andrew Cunningham <cu...@csiro.au>.

Thanks Doug,
This appears to works like a charm.

Doug Cutting wrote:

> Doug Cutting wrote:
>
>> You could use a custom Similarity implementation for this query, 
>> where tf() is the identity function, idf() returns 1.0, etc., so that 
>> the final score is the occurance count.  You'll need to divide by 
>> Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to 
>> get rid of the lengthNorm() and field boost (if any).
>
>
> Much simpler would be to build a SpanNearQuery, call getSpans(), then 
> loop, counting how many times Spans.next() returns true.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Doug Cutting <cu...@apache.org>.

Doug Cutting wrote:
> You could use a custom Similarity implementation for this query, where 
> tf() is the identity function, idf() returns 1.0, etc., so that the 
> final score is the occurance count.  You'll need to divide by 
> Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to get 
> rid of the lengthNorm() and field boost (if any).

Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Doug Cutting <cu...@apache.org>.

Andrew Cunningham wrote:
> "computer dog"~50 looks like what I'm after - now is there someway I can 
> call this and pull
> out the number of total occurances, not just the number of documents 
> hits? (say if computer
> and dog occur near each other several times in the same document).

You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Andrew Cunningham <cu...@csiro.au>.

"computer dog"~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

Paul Elschot wrote:

>On Thursday 23 December 2004 07:50, Andrew.Cunningham@csiro.au wrote:
>  
>
>>Hi all,
>>
>>I have a curious problem, and initial poking around with Lucene looks
>>like it may only be able to half-handle the problem.
>>
>> 
>>
>>The problem requires two abilities:
>>
>>1.	To be able to return the number of times the word appears in all
>>the documents (which it looks like lucene can do through IndexReader) 
>>2.	To be able to return the number of word co-occurrences within
>>the document set (ie. How many times does "computer" appear within 50
>>words of  "dog") 
>>
>> 
>>
>>Is the second point possible?
>>    
>>
>
>You can use the standard query parser with a query like this:
>"dog computer"~50
>This query is not completely symmetric in the distance computation:
>when computer occurs before dog, the allowed distance is 49, iirc.
>
>There is also a SpanNearQuery for more generalized and flexible
>distance queries, but this is not supported by the query parser,
>so you'll have to construct these queries in your own program code.
>
>In case you have non standard retrieval requirements, eg. you only
>need the number of hits and no further information from the matching
>documents, you may consider using your own HitCollector on the
>lower level search methods.
>
>Regards,
>Paul Elschot
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Paul Elschot <pa...@xs4all.nl>.

On Thursday 23 December 2004 07:50, Andrew.Cunningham@csiro.au wrote:
> Hi all,
> 
> I have a curious problem, and initial poking around with Lucene looks
> like it may only be able to half-handle the problem.
> 
>  
> 
> The problem requires two abilities:
> 
> 1.	To be able to return the number of times the word appears in all
> the documents (which it looks like lucene can do through IndexReader) 
> 2.	To be able to return the number of word co-occurrences within
> the document set (ie. How many times does "computer" appear within 50
> words of  "dog") 
>
>  
> 
> Is the second point possible?

You can use the standard query parser with a query like this:
"dog computer"~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.

There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.

In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Andrew Cunningham <cu...@csiro.au>.

Ah, so is it possible to return the number of times a term appears?

Daniel Naber wrote:

>On Thursday 23 December 2004 07:50, Andrew.Cunningham@csiro.au wrote:
>
>  
>
>>1.      To be able to return the number of times the word appears in all
>>the documents (which it looks like lucene can do through IndexReader)
>>    
>>
>
>If you're referring to docFreq(Term t) , that will only return the number 
>of documents that contain the term, ignoring how often the term occurs in 
>these documents.
>
>Regards
> Daniel
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Daniel Naber <da...@t-online.de>.

On Thursday 23 December 2004 07:50, Andrew.Cunningham@csiro.au wrote:

> 1.      To be able to return the number of times the word appears in all
> the documents (which it looks like lucene can do through IndexReader)

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word co-occurrences counts

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 23, 2004, at 1:50 AM, <An...@csiro.au> wrote:
> 2.	To be able to return the number of word co-occurrences within
> the document set (ie. How many times does "computer" appear within 50
> words of  "dog")
>
>
>
> Is the second point possible?

SpanNearQuery is your friend!  Like Paul said, this is not currently 
supported by QueryParser, however it is easy to do with the API.

Here's an example with a SpanOrQuery (a SpanNearQuery works 
identically) from the Lucene in Action code SpanQueryTest.java.  Two 
documents are indexed:

         "the quick brown fox jumps over the lazy dog"

         "the quick red fox jumps over the sleepy cat"

This SpanOrQuery is formed (omitting some code details):

     SpanOrQuery or = new SpanOrQuery(new SpanQuery[]{quick, fox});

And the spans are displayed:

spanOr([f:quick, f:fox]):
    the <quick> brown fox jumps over the lazy dog (0.37158427)
    the quick brown <fox> jumps over the lazy dog (0.37158427)
    the <quick> red fox jumps over the sleepy cat (0.37158427)
    the quick red <fox> jumps over the sleepy cat (0.37158427)

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org