You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Sibert <ch...@attbi.com> on 2002/06/12 08:26:58 UTC

Creating indexes

I have a big ( 40 MB or so) file to index. The file contains a whole bunch
of documents, which are each pretty small, about a few typewritten pages
long. There's a title, date, and author for each document, in addition to
the documents' actual text.

I'm not quite sure how you index this in Lucene. For each document in the
original file, I assume that I create a separate Lucene Document object in
the index with author, date, title, and text fields. If so, my question is
that when I'm reading in the original file for indexing, does Lucene know
where each document begins and ends in the original file ? Or do I have to
write a parser or filter or something for the InputStream that's reading the
file ?

Chris Sibert



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Creating indexes

Posted by "Nader S. Henein" <ns...@bayt.net>.
depending on the build of the document, but I guess not,
I had to write my own XML parser, you get better results when
you customize something like that to your needs.

-----Original Message-----
From: Chris Sibert [mailto:chrissibert@attbi.com]
Sent: Wednesday, June 12, 2002 10:27 AM
To: Lucene Users List
Subject: Creating indexes


I have a big ( 40 MB or so) file to index. The file contains a whole bunch
of documents, which are each pretty small, about a few typewritten pages
long. There's a title, date, and author for each document, in addition to
the documents' actual text.

I'm not quite sure how you index this in Lucene. For each document in the
original file, I assume that I create a separate Lucene Document object in
the index with author, date, title, and text fields. If so, my question is
that when I'm reading in the original file for indexing, does Lucene know
where each document begins and ends in the original file ? Or do I have to
write a parser or filter or something for the InputStream that's reading the
file ?

Chris Sibert



--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>