You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sven <sv...@gmail.com> on 2008/11/13 17:08:47 UTC

constructing a mini-index with just the number of hits for a term

First - I apologize for the double post on my earlier email.  The first 
time I sent it I received an error message from support@magentanews.com 
saying that I should instead send email to java-user@meltwater.com so I 
thought it did not go through. 

My question is this - is there a way to use the Lucene/Solr 
infrastructure to create a mini-index that simply contains a lookup 
table of terms and the number of times they have appeared? 

I do not need to record which documents have them nor do I need to know 
where in the documents they appear.  There could be (and probably will 
be) more than 2^32 terms, however.  I'm not sure if that makes a 
difference to the Lucene backend, but thought it might be relevant. 

This question coincides with my earlier question about counting the 
times a given term is associated with another term.  I figure that this 
would be more easily accomplished by making the mini-index described 
above alongside the normal index when a document is indexed.  For 
example, when scanning:

Bravely bold Sir Robin, brought forth from Camelot.  He was not afraid 
to die!  Oh, brave Sir Robin!

In addition to the normal indexing function of Lucene, I would like to 
write something on the backend to also index:

bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot  ("from" being a stop word)
...and so on

I only need to keep a running total of each "bravely|bold" term, 
however, since the number of terms will be quite large and keeping track 
of the document/termpositions would translate to a lot of wasted HD space. 

If such a thing is not already in place, could someone let me know if 
there are some tutorials, documentation, or presentations that describe 
the inner workings of Lucene and the theories/implementation at work for 
the actual file formats, structures, data manipulations, etc?  (The 
javadocs don't go into this kind of detail.)  I'm sure I can sift 
through the code and eventually make sense of it, but if there is 
documentation out there, I'd prefer to peruse that first.  My thought 
being that I can simply generate my own kind of hash for each combined 
term and write it out to a custom file structure similar to Lucene - but 
the specifics of how to (optimally) do so are not plain to me yet. 

Thanks again!
-Sven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: constructing a mini-index with just the number of hits for a term

Posted by Michael McCandless <lu...@mikemccandless.com>.
Flexible indexing (LUCENE-1458) should make this possible.

IE you could use your own codec which discards doc/freq/prox/payload  
and during indexing (for this one field) and simply stores the term  
frequency in the terms dict.  However, one problem will be deletions  
(in case it matters to your app): in order to properly update the  
terms dict counts, SegmentMerger walks through the docIDs for the term  
and skips the deleted ones.

But it will be some time before this is real, though there's an  
initial patch on LUCENE-1458.

Mike

Grant Ingersoll wrote:

> Can you share what the actual problem is that you are trying to  
> solve?  It might help put things in context for me.  I'm guessing  
> you are doing some type of co-occurrence analysis, but...
>
> More below.
>
> On Nov 13, 2008, at 11:08 AM, Sven wrote:
>
>> First - I apologize for the double post on my earlier email.  The  
>> first time I sent it I received an error message from support@magentanews.com 
>>  saying that I should instead send email to java-user@meltwater.com  
>> so I thought it did not go through.
>> My question is this - is there a way to use the Lucene/Solr  
>> infrastructure to create a mini-index that simply contains a lookup  
>> table of terms and the number of times they have appeared?
>
> This could be possible.  I think I would create documents with  
> Index.ANALYZED, and Store.NO.  Then, you just need to use the  
> TermEnum and TermDocs to access the information that you need.  In a  
> sense, you are just creating the term dictionary.  You could also  
> turn off storing of NORMS, which will save too.
>
>>
>> I do not need to record which documents have them nor do I need to  
>> know where in the documents they appear.  There could be (and  
>> probably will be) more than 2^32 terms, however.
>
> 2^32 unique terms or 2^32 total terms?
>
>> I'm not sure if that makes a difference to the Lucene backend, but  
>> thought it might be relevant.
>> This question coincides with my earlier question about counting the  
>> times a given term is associated with another term.  I figure that  
>> this would be more easily accomplished by making the mini-index  
>> described above alongside the normal index when a document is  
>> indexed.  For example, when scanning:
>>
>> Bravely bold Sir Robin, brought forth from Camelot.  He was not  
>> afraid to die!  Oh, brave Sir Robin!
>>
>> In addition to the normal indexing function of Lucene, I would like  
>> to write something on the backend to also index:
>>
>> bravely|bold
>> bravely|sir
>> bravely|robin
>> bravely|brought
>> bravely|forth
>> bold|sir
>> bold|robin
>> bold|brought
>> bold|forth
>> bold|camelot  ("from" being a stop word)
>> ...and so on
>>
>> I only need to keep a running total of each "bravely|bold" term,  
>> however, since the number of terms will be quite large and keeping  
>> track of the document/termpositions would translate to a lot of  
>> wasted HD space.
>
> For this, I think you will have to hook into the Analyzer process.   
> The other thing to do is just try keeping the document/term  
> positions, it may not actually be as bad as you think in terms of  
> space.
>
>>
>> If such a thing is not already in place, could someone let me know  
>> if there are some tutorials, documentation, or presentations that  
>> describe the inner workings of Lucene and the theories/ 
>> implementation at work for the actual file formats, structures,  
>> data manipulations, etc?  (The javadocs don't go into this kind of  
>> detail.)  I'm sure I can sift through the code and eventually make  
>> sense of it, but if there is documentation out there, I'd prefer to  
>> peruse that first.  My thought being that I can simply generate my  
>> own kind of hash for each combined term and write it out to a  
>> custom file structure similar to Lucene - but the specifics of how  
>> to (optimally) do so are not plain to me yet.
>> Thanks again!
>> -Sven
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: constructing a mini-index with just the number of hits for a term

Posted by Grant Ingersoll <gs...@apache.org>.
Can you share what the actual problem is that you are trying to  
solve?  It might help put things in context for me.  I'm guessing you  
are doing some type of co-occurrence analysis, but...

More below.

On Nov 13, 2008, at 11:08 AM, Sven wrote:

> First - I apologize for the double post on my earlier email.  The  
> first time I sent it I received an error message from support@magentanews.com 
>  saying that I should instead send email to java-user@meltwater.com  
> so I thought it did not go through.
> My question is this - is there a way to use the Lucene/Solr  
> infrastructure to create a mini-index that simply contains a lookup  
> table of terms and the number of times they have appeared?

This could be possible.  I think I would create documents with  
Index.ANALYZED, and Store.NO.  Then, you just need to use the TermEnum  
and TermDocs to access the information that you need.  In a sense, you  
are just creating the term dictionary.  You could also turn off  
storing of NORMS, which will save too.

>
> I do not need to record which documents have them nor do I need to  
> know where in the documents they appear.  There could be (and  
> probably will be) more than 2^32 terms, however.

2^32 unique terms or 2^32 total terms?

>  I'm not sure if that makes a difference to the Lucene backend, but  
> thought it might be relevant.
> This question coincides with my earlier question about counting the  
> times a given term is associated with another term.  I figure that  
> this would be more easily accomplished by making the mini-index  
> described above alongside the normal index when a document is  
> indexed.  For example, when scanning:
>
> Bravely bold Sir Robin, brought forth from Camelot.  He was not  
> afraid to die!  Oh, brave Sir Robin!
>
> In addition to the normal indexing function of Lucene, I would like  
> to write something on the backend to also index:
>
> bravely|bold
> bravely|sir
> bravely|robin
> bravely|brought
> bravely|forth
> bold|sir
> bold|robin
> bold|brought
> bold|forth
> bold|camelot  ("from" being a stop word)
> ...and so on
>
> I only need to keep a running total of each "bravely|bold" term,  
> however, since the number of terms will be quite large and keeping  
> track of the document/termpositions would translate to a lot of  
> wasted HD space.

For this, I think you will have to hook into the Analyzer process.   
The other thing to do is just try keeping the document/term positions,  
it may not actually be as bad as you think in terms of space.

>
> If such a thing is not already in place, could someone let me know  
> if there are some tutorials, documentation, or presentations that  
> describe the inner workings of Lucene and the theories/ 
> implementation at work for the actual file formats, structures, data  
> manipulations, etc?  (The javadocs don't go into this kind of  
> detail.)  I'm sure I can sift through the code and eventually make  
> sense of it, but if there is documentation out there, I'd prefer to  
> peruse that first.  My thought being that I can simply generate my  
> own kind of hash for each combined term and write it out to a custom  
> file structure similar to Lucene - but the specifics of how to  
> (optimally) do so are not plain to me yet.
> Thanks again!
> -Sven
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org