You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Liaqat Ali <li...@gmail.com> on 2007/10/24 15:02:13 UTC

Corpus interpretation

I want to index the Urdu language corpus (200 documents in CES XML DTD 
format). Is net necessary to break the XML file into 200 different files 
or it can be indexed in the original form using Lucene. Kindly guide in 
this regard.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Corpus interpretation

Posted by Steven Rowe <sa...@syr.edu>.

Hi Liaqat,

Liaqat Ali wrote:
> I want to index the Urdu language corpus (200 documents in CES XML DTD
> format). Is net necessary to break the XML file into 200 different files
> or it can be indexed in the original form using Lucene. Kindly guide in
> this regard.

A Lucene document is composed of one or more fields.  You will choose
which fields each document will have.  In your initial implementation,
you may choose to extract all text from each document and place it in a
single indexed text field.

It is your responsibility to locate and open your input sources and
break them up or combine them to produce the document field data -
Lucene does not provide this functionality for you.

It is your choice whether you break the input files before you index
them or as part of the indexing process - in either case, it is your
responsibility, not Lucene's.  This choice will depend on the parsing
library you choose, the size of the corpus, and the amount of memory
available on the machine on which you perform the indexing.  If the
corpus is small, and/or you process the source XML file with a parser
which does not hold the entire contents in memory (e.g. SAX), and/or the
machine has lots of memory, it should be okay to construct document
fields on-the-fly, instead of first splitting the original file up.

Steve

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org