You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2010/11/10 22:02:51 UTC

API access to in-memory tii file (3.x not flex).

Hello all,

We have an extremely large number of terms in our indexes.  I want to be able to extract a sample of the terms, say something like every 128th term.   If I use code based on org.apache.lucene.misc.HighFreqTerms or org.apache.lucene.index.CheckIndex I would get a TermsEnum, call termEnum.next() 128 times, grab the term and then call next another 128 times.
termEnum = reader.terms();
while (termEnum.next()
{ }

Since the tii file contains every 128th (or IndexInterval ) term and it is loaded into memory, is there some programmatic way (in the public API) to read that data structure in memory rather than having to force Lucene to actually read the entire tis file by using termEnum.next() ?


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

Re: API access to in-memory tii file (3.x not flex).

Posted by Jason Rutherglen <ja...@gmail.com>.

In a word, no.  You'd need to customize the Lucene source to accomplish this.

On Wed, Nov 10, 2010 at 1:02 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hello all,
>
> We have an extremely large number of terms in our indexes.  I want to be able to extract a sample of the terms, say something like every 128th term.   If I use code based on org.apache.lucene.misc.HighFreqTerms or org.apache.lucene.index.CheckIndex I would get a TermsEnum, call termEnum.next() 128 times, grab the term and then call next another 128 times.
> termEnum = reader.terms();
> while (termEnum.next()
> { }
>
> Since the tii file contains every 128th (or IndexInterval ) term and it is loaded into memory, is there some programmatic way (in the public API) to read that data structure in memory rather than having to force Lucene to actually read the entire tis file by using termEnum.next() ?
>
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: API access to in-memory tii file (3.x not flex).

Posted by Jason Rutherglen <ja...@gmail.com>.

Yeah that's customizing the Lucene source. :)  I should have gone into
more detail, I will next time.

On Wed, Nov 10, 2010 at 2:10 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Actually, the .tii file pre-flex (3.x) is nearly identical to the .tis
> file, just that it only contains every 128th term.
>
> If you just make SegmentTermEnum public (or, sneak your class into
> oal.index package) then you can instantiate SegmentTermsEnum passing
> it an IndexInput opened on the .tii file.
>
> Then you can enum the terms directly...
>
> Mike
>
> On Wed, Nov 10, 2010 at 4:02 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>> Hello all,
>>
>> We have an extremely large number of terms in our indexes.  I want to be able to extract a sample of the terms, say something like every 128th term.   If I use code based on org.apache.lucene.misc.HighFreqTerms or org.apache.lucene.index.CheckIndex I would get a TermsEnum, call termEnum.next() 128 times, grab the term and then call next another 128 times.
>> termEnum = reader.terms();
>> while (termEnum.next()
>> { }
>>
>> Since the tii file contains every 128th (or IndexInterval ) term and it is loaded into memory, is there some programmatic way (in the public API) to read that data structure in memory rather than having to force Lucene to actually read the entire tis file by using termEnum.next() ?
>>
>>
>> Tom Burton-West
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: API access to in-memory tii file (3.x not flex).

Posted by Michael McCandless <lu...@mikemccandless.com>.

Actually, the .tii file pre-flex (3.x) is nearly identical to the .tis
file, just that it only contains every 128th term.

If you just make SegmentTermEnum public (or, sneak your class into
oal.index package) then you can instantiate SegmentTermsEnum passing
it an IndexInput opened on the .tii file.

Then you can enum the terms directly...

Mike

On Wed, Nov 10, 2010 at 4:02 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hello all,
>
> We have an extremely large number of terms in our indexes.  I want to be able to extract a sample of the terms, say something like every 128th term.   If I use code based on org.apache.lucene.misc.HighFreqTerms or org.apache.lucene.index.CheckIndex I would get a TermsEnum, call termEnum.next() 128 times, grab the term and then call next another 128 times.
> termEnum = reader.terms();
> while (termEnum.next()
> { }
>
> Since the tii file contains every 128th (or IndexInterval ) term and it is loaded into memory, is there some programmatic way (in the public API) to read that data structure in memory rather than having to force Lucene to actually read the entire tis file by using termEnum.next() ?
>
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org