You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Liaqat Ali <li...@gmail.com> on 2007/12/25 22:02:14 UTC
problem in indexing documents
hello,
I am try to make an index of 191 documents stored in 191 text files. I
developed a program, which works well with files containing single line,
but files with multiple lines posing a problem.So i added while loop to
completely extract data from each document. But it has some logical
error. Well the given code is an right approach to my problem? Kindly
give some guidelines.
StringBuffer sb = new StringBuffer();
Analyzer analyzer = new StandardAnalyzer();
boolean createFlag = true;
IndexWriter writer =
new IndexWriter(indexDir, analyzer, createFlag);
for (int i=1;i<=191;i++) {
Reader file = new InputStreamReader(new
FileInputStream("corpus\\doc" + i + ".txt"), "UTF-8");
BufferedReader buff = new BufferedReader(file);
while( (line = buff.readLine()) != null) {
sb.append(line);
}
Document document = new Document();
document.add(new Field("contents",sb.toString(),
Field.Store.NO, Field.Index.TOKENIZED));
writer.addDocument(document);
buff.close();
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: problem in indexing documents
Posted by Doron Cohen <cd...@gmail.com>.
>
> > document.add(new Field("contents",sb.toString(),
> > Field.Store.NO, Field.Index.TOKENIZED));
>
In addition, for tokenized but not stored like here, the Field()
constructor that takes a Reader param can be handy here.
Regards, Doron
Re: problem in indexing documents
Posted by Erick Erickson <er...@gmail.com>.
It's more helpful to indicate what error you're receiving or
what your perceived problem is. Without that, we can only
guess...
But one thing wrong is that you keep appending to the
same StringBuffer forever, so your first writer.AddDocument
adds document 1. Your second adds the text of BOTH doc
1 and doc 2. Your third AddDocument adds the contents of
docs 1, 2, 3. Etc.....
Just move the declaration for sb inside your for loop....
Note two things:
1> you can get the same effect by doing a
document.add() for *each* line. That is,
instead of the sb.append line, just do a
document.add() for line. Then do your
writer.addDocument outside the while loop as
you do now.
2> there is a built-in default maximum of 10,000 terms that get
indexed. You can change this, see SetMaxFieldLength on
IndexWriter.
Best
Erick
On Dec 25, 2007 4:02 PM, Liaqat Ali <li...@gmail.com> wrote:
> hello,
>
> I am try to make an index of 191 documents stored in 191 text files. I
> developed a program, which works well with files containing single line,
> but files with multiple lines posing a problem.So i added while loop to
> completely extract data from each document. But it has some logical
> error. Well the given code is an right approach to my problem? Kindly
> give some guidelines.
>
>
>
>
> StringBuffer sb = new StringBuffer();
>
> Analyzer analyzer = new StandardAnalyzer();
> boolean createFlag = true;
> IndexWriter writer =
> new IndexWriter(indexDir, analyzer, createFlag);
>
> for (int i=1;i<=191;i++) {
>
> Reader file = new InputStreamReader(new
> FileInputStream("corpus\\doc" + i + ".txt"), "UTF-8");
>
>
> BufferedReader buff = new BufferedReader(file);
>
> while( (line = buff.readLine()) != null) {
> sb.append(line);
> }
>
> Document document = new Document();
> document.add(new Field("contents",sb.toString(),
> Field.Store.NO, Field.Index.TOKENIZED));
> writer.addDocument(document);
>
> buff.close();
>
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>