You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by starz10de <fa...@yahoo.com> on 2008/07/22 20:53:09 UTC

storing the contents of a document in the lucene index

  Could any one tell me please how to print the content of the document after
reading the index.
for example if i like to print the  index terms then i do :

IndexReader ir = IndexReader.open(index);
TermEnum termEnum = ir.terms(); 
while (termEnum.next()) {
			TermDocs dok = ir.termDocs();
			dok.seek(termEnum);
			while (dok.next()) {
System.out.println(termEnum.term().text().trim());
				}

I can print the text files before indexing them, but because of encoding
issues i like to print them from the index.
As i know the content of the document(whole text) is also stored in the
index, my question how to print this content.

so at the end i will print the path of the current document , index terms
and the content of the document


thanks in advance
-- 
View this message in context: http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18595855.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: storing the contents of a document in the lucene index

Posted by Erick Erickson <er...@gmail.com>.

<<<As i know the content of the document(whole text) is also stored in the
index, my question how to print this content.>>>

This not strictly true. For instance, stop words aren't even indexed.
Reconstructing a document from the index is very expensive
(see Luke for examples of how this is done).

You can get the text back verbatim if you store it in your index. See
Field.Store.YES (or Field.Store.COMPRESS). Storage is orthogonal
to indexing, so you can index the tokens in a field but not store them,
store them but not index them, or do both. Not storing and not indexing
is, I guess, theoretically possible but I sure can't see why you'd try it
<G>.

But if you store the field, you can get it back very easily with
Document.get("field").
Storing the fields will make your index larger, but shouldn't have a great
effect on your search times I don't think.

Best
Erick

On Tue, Jul 22, 2008 at 2:53 PM, starz10de <fa...@yahoo.com> wrote:

>
>  Could any one tell me please how to print the content of the document
> after
> reading the index.
> for example if i like to print the  index terms then i do :
>
> IndexReader ir = IndexReader.open(index);
> TermEnum termEnum = ir.terms();
> while (termEnum.next()) {
>                        TermDocs dok = ir.termDocs();
>                        dok.seek(termEnum);
>                        while (dok.next()) {
> System.out.println(termEnum.term().text().trim());
>                                }
>
> I can print the text files before indexing them, but because of encoding
> issues i like to print them from the index.
> As i know the content of the document(whole text) is also stored in the
> index, my question how to print this content.
>
> so at the end i will print the path of the current document , index terms
> and the content of the document
>
>
> thanks in advance
> --
> View this message in context:
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18595855.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: storing the contents of a document in the lucene index

Posted by Erick Erickson <er...@gmail.com>.

I thought of one more thing you should be aware of. The
the default field length for any field (no matter which of the
two forms you use) is 10,000 tokens.

This can be easily changed, see
IndexWriter.setMaxFieldLength().

Best
Erick

On Thu, Jul 24, 2008 at 9:25 AM, starz10de <fa...@yahoo.com> wrote:

>
> Dear Erick ,
>
>  Thnaks for your answer, I tryed other way ,  where I read the text files
> before i index them. I will try also your solution here.
>
> best regards
>
>
> Erick Erickson wrote:
> >
> > OK, I'm finally catching on. You have to change the demo code to
> > get the contents into something besides an input stream, so you
> > can use one of the alternate forms of the Field constructor. For
> > instance, you could read it all into a string and use the form:
> >
> > doc.add(new Field("content", <string with all the file contents in it>,
> >                Field.Store.YES, Field.Index.TOKENIZED))
> >
> >
> > Or, you can do something like this, which produces identical results
> > to the above
> >
> > while (more text to read) {
> >      String line = read a line of text from the file
> >      doc.add(new Field("content", line, Field.Store.YES,
> > Field.Index.TOKENIZED))
> > }
> >
> > You can add to the same field as often as you want and it just appends
> the
> > content of calls 2 to N to the same field.
> >
> >
> > Best
> > Erick
> >
> >
> > On Wed, Jul 23, 2008 at 3:42 AM, starz10de <fa...@yahoo.com>
> wrote:
> >
> >>
> >> Hi Erik,
> >>
> >>  I don't remove the stop words, as I index parallel corpora which is
> used
> >> for learning the translations between pair of languages. so every word
> is
> >> important. I even develop my own analyzer for Arabic which is just
> remove
> >> punctuations and special symbols and it return only Arabic text.
> >>
> >> I guess in the   FileDocument.java   the whole text is already stored
> >>
> >> doc.add(Field.Text("contents", IN));
> >>
> >> where IN is
> >>
> >> IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))
> >>
> >> if this is not the case yould you please how to store the whole text
> >> inside
> >> the index ?
> >>
> >> I am new to lucene and I don't know how to use this "Field.Store.YES" to
> >> store whole text.
> >>
> >>
> >>
> >> Best regards
> >> Farag
> >>
> >>
> >>
> >> starz10de wrote:
> >> >
> >> >   Could any one tell me please how to print the content of the
> document
> >> > after reading the index.
> >> > for example if i like to print the  index terms then i do :
> >> >
> >> > IndexReader ir = IndexReader.open(index);
> >> > TermEnum termEnum = ir.terms();
> >> > while (termEnum.next()) {
> >> >                       TermDocs dok = ir.termDocs();
> >> >                       dok.seek(termEnum);
> >> >                       while (dok.next()) {
> >> > System.out.println(termEnum.term().text().trim());
> >> >                               }
> >> >
> >> > I can print the text files before indexing them, but because of
> >> encoding
> >> > issues i like to print them from the index.
> >> > As i know the content of the document(whole text) is also stored in
> the
> >> > index, my question how to print this content.
> >> >
> >> > so at the end i will print the path of the current document , index
> >> terms
> >> > and the content of the document
> >> >
> >> >
> >> > thanks in advance
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18631887.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: storing the contents of a document in the lucene index

Posted by starz10de <fa...@yahoo.com>.

Dear Erick ,

 Thnaks for your answer, I tryed other way ,  where I read the text files
before i index them. I will try also your solution here.

best regards


Erick Erickson wrote:
> 
> OK, I'm finally catching on. You have to change the demo code to
> get the contents into something besides an input stream, so you
> can use one of the alternate forms of the Field constructor. For
> instance, you could read it all into a string and use the form:
> 
> doc.add(new Field("content", <string with all the file contents in it>,
>                Field.Store.YES, Field.Index.TOKENIZED))
> 
> 
> Or, you can do something like this, which produces identical results
> to the above
> 
> while (more text to read) {
>      String line = read a line of text from the file
>      doc.add(new Field("content", line, Field.Store.YES,
> Field.Index.TOKENIZED))
> }
> 
> You can add to the same field as often as you want and it just appends the
> content of calls 2 to N to the same field.
> 
> 
> Best
> Erick
> 
> 
> On Wed, Jul 23, 2008 at 3:42 AM, starz10de <fa...@yahoo.com> wrote:
> 
>>
>> Hi Erik,
>>
>>  I don't remove the stop words, as I index parallel corpora which is used
>> for learning the translations between pair of languages. so every word is
>> important. I even develop my own analyzer for Arabic which is just remove
>> punctuations and special symbols and it return only Arabic text.
>>
>> I guess in the   FileDocument.java   the whole text is already stored
>>
>> doc.add(Field.Text("contents", IN));
>>
>> where IN is
>>
>> IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))
>>
>> if this is not the case yould you please how to store the whole text
>> inside
>> the index ?
>>
>> I am new to lucene and I don't know how to use this "Field.Store.YES" to
>> store whole text.
>>
>>
>>
>> Best regards
>> Farag
>>
>>
>>
>> starz10de wrote:
>> >
>> >   Could any one tell me please how to print the content of the document
>> > after reading the index.
>> > for example if i like to print the  index terms then i do :
>> >
>> > IndexReader ir = IndexReader.open(index);
>> > TermEnum termEnum = ir.terms();
>> > while (termEnum.next()) {
>> >                       TermDocs dok = ir.termDocs();
>> >                       dok.seek(termEnum);
>> >                       while (dok.next()) {
>> > System.out.println(termEnum.term().text().trim());
>> >                               }
>> >
>> > I can print the text files before indexing them, but because of
>> encoding
>> > issues i like to print them from the index.
>> > As i know the content of the document(whole text) is also stored in the
>> > index, my question how to print this content.
>> >
>> > so at the end i will print the path of the current document , index
>> terms
>> > and the content of the document
>> >
>> >
>> > thanks in advance
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18631887.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: storing the contents of a document in the lucene index

Posted by Erick Erickson <er...@gmail.com>.

OK, I'm finally catching on. You have to change the demo code to
get the contents into something besides an input stream, so you
can use one of the alternate forms of the Field constructor. For
instance, you could read it all into a string and use the form:

doc.add(new Field("content", <string with all the file contents in it>,
               Field.Store.YES, Field.Index.TOKENIZED))


Or, you can do something like this, which produces identical results
to the above

while (more text to read) {
     String line = read a line of text from the file
     doc.add(new Field("content", line, Field.Store.YES,
Field.Index.TOKENIZED))
}

You can add to the same field as often as you want and it just appends the
content of calls 2 to N to the same field.


Best
Erick


On Wed, Jul 23, 2008 at 3:42 AM, starz10de <fa...@yahoo.com> wrote:

>
> Hi Erik,
>
>  I don't remove the stop words, as I index parallel corpora which is used
> for learning the translations between pair of languages. so every word is
> important. I even develop my own analyzer for Arabic which is just remove
> punctuations and special symbols and it return only Arabic text.
>
> I guess in the   FileDocument.java   the whole text is already stored
>
> doc.add(Field.Text("contents", IN));
>
> where IN is
>
> IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))
>
> if this is not the case yould you please how to store the whole text inside
> the index ?
>
> I am new to lucene and I don't know how to use this "Field.Store.YES" to
> store whole text.
>
>
>
> Best regards
> Farag
>
>
>
> starz10de wrote:
> >
> >   Could any one tell me please how to print the content of the document
> > after reading the index.
> > for example if i like to print the  index terms then i do :
> >
> > IndexReader ir = IndexReader.open(index);
> > TermEnum termEnum = ir.terms();
> > while (termEnum.next()) {
> >                       TermDocs dok = ir.termDocs();
> >                       dok.seek(termEnum);
> >                       while (dok.next()) {
> > System.out.println(termEnum.term().text().trim());
> >                               }
> >
> > I can print the text files before indexing them, but because of encoding
> > issues i like to print them from the index.
> > As i know the content of the document(whole text) is also stored in the
> > index, my question how to print this content.
> >
> > so at the end i will print the path of the current document , index terms
> > and the content of the document
> >
> >
> > thanks in advance
> >
>
> --
> View this message in context:
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: storing the contents of a document in the lucene index

Posted by starz10de <fa...@yahoo.com>.

Hi Erik,

 I don't remove the stop words, as I index parallel corpora which is used
for learning the translations between pair of languages. so every word is
important. I even develop my own analyzer for Arabic which is just remove
punctuations and special symbols and it return only Arabic text.

I guess in the   FileDocument.java   the whole text is already stored

doc.add(Field.Text("contents", IN)); 

where IN is 

IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))

if this is not the case yould you please how to store the whole text inside
the index ? 

I am new to lucene and I don't know how to use this "Field.Store.YES" to
store whole text.

 

Best regards
Farag



starz10de wrote:
> 
>   Could any one tell me please how to print the content of the document
> after reading the index.
> for example if i like to print the  index terms then i do :
> 
> IndexReader ir = IndexReader.open(index);
> TermEnum termEnum = ir.terms(); 
> while (termEnum.next()) {
> 			TermDocs dok = ir.termDocs();
> 			dok.seek(termEnum);
> 			while (dok.next()) {
> System.out.println(termEnum.term().text().trim());
> 				}
> 
> I can print the text files before indexing them, but because of encoding
> issues i like to print them from the index.
> As i know the content of the document(whole text) is also stored in the
> index, my question how to print this content.
> 
> so at the end i will print the path of the current document , index terms
> and the content of the document
> 
> 
> thanks in advance
> 

-- 
View this message in context: http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org