You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sahin Buyrukbilen <sa...@gmail.com> on 2010/09/21 18:11:50 UTC

How to export lucene index to a simple text file?

Hi,

I am currently working on a project about private information retrieval and
I need to have an inverted index file in txt format as follows:

Term t    freq t      Inverted list for t
-------------------------------------------------------------------------
and          1          <6, 0.159>
big           2          <2, 0.148> <3, 0.088>
dark         1          <6, 0.079>
.
.
.
.

here the <number1, number2> pairs are indicating: number1: doc ID, where
term t exist with a rank of number2.

I have created an index from 5492 txt files, however the index is composed
of different files and most of the data is not in the text format.

could somebody guide me to achieve this?

Thank you

Sahin.

Re: How to export lucene index to a simple text file?

Posted by Adriano Crestani <ad...@gmail.com>.
> Saving the index in text format would also be a fun codec (in 4.0) to create :)

A codec like that would be welcome :)

On Wed, Sep 22, 2010 at 5:31 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Saving the index in text format would also be a fun codec (in 4.0) to create :)
>
> Ie, the codec would be read/write.  The performance wouldn't be great,
> but it'd be neat for debugging, teaching, transparency purposes...
>
> Mike
>
> On Tue, Sep 21, 2010 at 9:26 PM, Lance Norskog <go...@gmail.com> wrote:
>> The Lucene CheckIndex program opens an index and walks all of the data
>> structures. It is a good start for you.
>>
>> Sahin Buyrukbilen wrote:
>>>
>>> Thank you Uwe, I will read the docs and try to do it, however do you have
>>> an
>>> example code? I need because I am not very familiar with Java.
>>>
>>> Thank you.
>>>
>>> Sahin
>>>
>>> On Tue, Sep 21, 2010 at 12:29 PM, Uwe Schindler<uw...@thetaphi.de>  wrote:
>>>
>>>
>>>>
>>>> Hi,
>>>>
>>>> Retrieve a TermEnum and iterate it. By that you get all terms and can
>>>> retrieve the docFreq, which is the second column in your table. Finally
>>>> for
>>>> each term you position the TermDocs enum on this term to get all document
>>>> ids. Read docs of IndexReader/TermEnum/TermDocs about this.
>>>>
>>>> Uwe
>>>>
>>>> -----
>>>> Uwe Schindler
>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>> http://www.thetaphi.de
>>>> eMail: uwe@thetaphi.de
>>>>
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sahin Buyrukbilen [mailto:sahin.buyrukbilen@gmail.com]
>>>>> Sent: Tuesday, September 21, 2010 9:12 AM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: How to export lucene index to a simple text file?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am currently working on a project about private information retrieval
>>>>>
>>>>
>>>> and I
>>>>
>>>>>
>>>>> need to have an inverted index file in txt format as follows:
>>>>>
>>>>> Term t    freq t      Inverted list for t
>>>>>
>>>>> -------------------------------------------------------------------------
>>>>> and          1<6, 0.159>
>>>>> big           2<2, 0.148>  <3, 0.088>
>>>>> dark         1<6, 0.079>
>>>>> .
>>>>> .
>>>>> .
>>>>> .
>>>>>
>>>>> here the<number1, number2>  pairs are indicating: number1: doc ID, where
>>>>> term t exist with a rank of number2.
>>>>>
>>>>> I have created an index from 5492 txt files, however the index is
>>>>>
>>>>
>>>> composed
>>>> of
>>>>
>>>>>
>>>>> different files and most of the data is not in the text format.
>>>>>
>>>>> could somebody guide me to achieve this?
>>>>>
>>>>> Thank you
>>>>>
>>>>> Sahin.
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to export lucene index to a simple text file?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Saving the index in text format would also be a fun codec (in 4.0) to create :)

Ie, the codec would be read/write.  The performance wouldn't be great,
but it'd be neat for debugging, teaching, transparency purposes...

Mike

On Tue, Sep 21, 2010 at 9:26 PM, Lance Norskog <go...@gmail.com> wrote:
> The Lucene CheckIndex program opens an index and walks all of the data
> structures. It is a good start for you.
>
> Sahin Buyrukbilen wrote:
>>
>> Thank you Uwe, I will read the docs and try to do it, however do you have
>> an
>> example code? I need because I am not very familiar with Java.
>>
>> Thank you.
>>
>> Sahin
>>
>> On Tue, Sep 21, 2010 at 12:29 PM, Uwe Schindler<uw...@thetaphi.de>  wrote:
>>
>>
>>>
>>> Hi,
>>>
>>> Retrieve a TermEnum and iterate it. By that you get all terms and can
>>> retrieve the docFreq, which is the second column in your table. Finally
>>> for
>>> each term you position the TermDocs enum on this term to get all document
>>> ids. Read docs of IndexReader/TermEnum/TermDocs about this.
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sahin Buyrukbilen [mailto:sahin.buyrukbilen@gmail.com]
>>>> Sent: Tuesday, September 21, 2010 9:12 AM
>>>> To: java-user@lucene.apache.org
>>>> Subject: How to export lucene index to a simple text file?
>>>>
>>>> Hi,
>>>>
>>>> I am currently working on a project about private information retrieval
>>>>
>>>
>>> and I
>>>
>>>>
>>>> need to have an inverted index file in txt format as follows:
>>>>
>>>> Term t    freq t      Inverted list for t
>>>>
>>>> -------------------------------------------------------------------------
>>>> and          1<6, 0.159>
>>>> big           2<2, 0.148>  <3, 0.088>
>>>> dark         1<6, 0.079>
>>>> .
>>>> .
>>>> .
>>>> .
>>>>
>>>> here the<number1, number2>  pairs are indicating: number1: doc ID, where
>>>> term t exist with a rank of number2.
>>>>
>>>> I have created an index from 5492 txt files, however the index is
>>>>
>>>
>>> composed
>>> of
>>>
>>>>
>>>> different files and most of the data is not in the text format.
>>>>
>>>> could somebody guide me to achieve this?
>>>>
>>>> Thank you
>>>>
>>>> Sahin.
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to export lucene index to a simple text file?

Posted by Lance Norskog <go...@gmail.com>.
The Lucene CheckIndex program opens an index and walks all of the data 
structures. It is a good start for you.

Sahin Buyrukbilen wrote:
> Thank you Uwe, I will read the docs and try to do it, however do you have an
> example code? I need because I am not very familiar with Java.
>
> Thank you.
>
> Sahin
>
> On Tue, Sep 21, 2010 at 12:29 PM, Uwe Schindler<uw...@thetaphi.de>  wrote:
>
>    
>> Hi,
>>
>> Retrieve a TermEnum and iterate it. By that you get all terms and can
>> retrieve the docFreq, which is the second column in your table. Finally for
>> each term you position the TermDocs enum on this term to get all document
>> ids. Read docs of IndexReader/TermEnum/TermDocs about this.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>      
>>> -----Original Message-----
>>> From: Sahin Buyrukbilen [mailto:sahin.buyrukbilen@gmail.com]
>>> Sent: Tuesday, September 21, 2010 9:12 AM
>>> To: java-user@lucene.apache.org
>>> Subject: How to export lucene index to a simple text file?
>>>
>>> Hi,
>>>
>>> I am currently working on a project about private information retrieval
>>>        
>> and I
>>      
>>> need to have an inverted index file in txt format as follows:
>>>
>>> Term t    freq t      Inverted list for t
>>> -------------------------------------------------------------------------
>>> and          1<6, 0.159>
>>> big           2<2, 0.148>  <3, 0.088>
>>> dark         1<6, 0.079>
>>> .
>>> .
>>> .
>>> .
>>>
>>> here the<number1, number2>  pairs are indicating: number1: doc ID, where
>>> term t exist with a rank of number2.
>>>
>>> I have created an index from 5492 txt files, however the index is
>>>        
>> composed
>> of
>>      
>>> different files and most of the data is not in the text format.
>>>
>>> could somebody guide me to achieve this?
>>>
>>> Thank you
>>>
>>> Sahin.
>>>        
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to export lucene index to a simple text file?

Posted by Sahin Buyrukbilen <sa...@gmail.com>.
Thank you Uwe, I will read the docs and try to do it, however do you have an
example code? I need because I am not very familiar with Java.

Thank you.

Sahin

On Tue, Sep 21, 2010 at 12:29 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> Retrieve a TermEnum and iterate it. By that you get all terms and can
> retrieve the docFreq, which is the second column in your table. Finally for
> each term you position the TermDocs enum on this term to get all document
> ids. Read docs of IndexReader/TermEnum/TermDocs about this.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Sahin Buyrukbilen [mailto:sahin.buyrukbilen@gmail.com]
> > Sent: Tuesday, September 21, 2010 9:12 AM
> > To: java-user@lucene.apache.org
> > Subject: How to export lucene index to a simple text file?
> >
> > Hi,
> >
> > I am currently working on a project about private information retrieval
> and I
> > need to have an inverted index file in txt format as follows:
> >
> > Term t    freq t      Inverted list for t
> > -------------------------------------------------------------------------
> > and          1          <6, 0.159>
> > big           2          <2, 0.148> <3, 0.088>
> > dark         1          <6, 0.079>
> > .
> > .
> > .
> > .
> >
> > here the <number1, number2> pairs are indicating: number1: doc ID, where
> > term t exist with a rank of number2.
> >
> > I have created an index from 5492 txt files, however the index is
> composed
> of
> > different files and most of the data is not in the text format.
> >
> > could somebody guide me to achieve this?
> >
> > Thank you
> >
> > Sahin.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: How to export lucene index to a simple text file?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Retrieve a TermEnum and iterate it. By that you get all terms and can
retrieve the docFreq, which is the second column in your table. Finally for
each term you position the TermDocs enum on this term to get all document
ids. Read docs of IndexReader/TermEnum/TermDocs about this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Sahin Buyrukbilen [mailto:sahin.buyrukbilen@gmail.com]
> Sent: Tuesday, September 21, 2010 9:12 AM
> To: java-user@lucene.apache.org
> Subject: How to export lucene index to a simple text file?
> 
> Hi,
> 
> I am currently working on a project about private information retrieval
and I
> need to have an inverted index file in txt format as follows:
> 
> Term t    freq t      Inverted list for t
> -------------------------------------------------------------------------
> and          1          <6, 0.159>
> big           2          <2, 0.148> <3, 0.088>
> dark         1          <6, 0.079>
> .
> .
> .
> .
> 
> here the <number1, number2> pairs are indicating: number1: doc ID, where
> term t exist with a rank of number2.
> 
> I have created an index from 5492 txt files, however the index is composed
of
> different files and most of the data is not in the text format.
> 
> could somebody guide me to achieve this?
> 
> Thank you
> 
> Sahin.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org