You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Liaqat Ali <li...@gmail.com> on 2007/12/25 22:02:14 UTC

problem in indexing documents

hello,

I am try to make an index of 191 documents stored in 191 text files. I 
developed a program, which works well with files containing single line, 
but files with multiple lines posing a problem.So i added while loop to 
completely extract data from each document. But it has some logical 
error. Well the given code is an right approach to my problem? Kindly 
give some guidelines. 




StringBuffer sb = new StringBuffer();

Analyzer analyzer = new StandardAnalyzer();
            boolean createFlag = true;
        IndexWriter writer =
                    new IndexWriter(indexDir, analyzer, createFlag);

        for (int i=1;i<=191;i++)  {

            Reader file = new InputStreamReader(new 
FileInputStream("corpus\\doc" + i + ".txt"), "UTF-8");
    

            BufferedReader buff = new BufferedReader(file);

            while( (line = buff.readLine()) != null) {
                        sb.append(line);
                }

                Document document  = new Document();
            document.add(new Field("contents",sb.toString(), 
Field.Store.NO, Field.Index.TOKENIZED));
            writer.addDocument(document);
        
            buff.close();

        }


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: problem in indexing documents

Posted by Doron Cohen <cd...@gmail.com>.

>
> >            document.add(new Field("contents",sb.toString(),
> > Field.Store.NO, Field.Index.TOKENIZED));
>

In addition, for tokenized but not stored like here, the Field()
constructor that takes a Reader param can be handy here.

Regards, Doron

Re: problem in indexing documents

Posted by Erick Erickson <er...@gmail.com>.

It's more helpful to indicate what error you're receiving or
what your perceived problem is. Without that, we can only
guess...

But one thing wrong is that you keep appending to the
same StringBuffer forever, so your first writer.AddDocument
adds  document 1. Your second adds the text of BOTH doc
1 and doc 2. Your third AddDocument adds the contents of
docs 1, 2, 3. Etc.....

Just move the declaration for sb inside your for loop....

Note two things:
1> you can get the same effect by doing a
    document.add() for *each* line. That is,
    instead of the sb.append line, just do a
    document.add() for line. Then do your
    writer.addDocument outside the while loop as
    you do now.
2> there is a built-in default maximum of 10,000 terms that get
    indexed. You can change this, see SetMaxFieldLength on
    IndexWriter.

Best
Erick

On Dec 25, 2007 4:02 PM, Liaqat Ali <li...@gmail.com> wrote:

> hello,
>
> I am try to make an index of 191 documents stored in 191 text files. I
> developed a program, which works well with files containing single line,
> but files with multiple lines posing a problem.So i added while loop to
> completely extract data from each document. But it has some logical
> error. Well the given code is an right approach to my problem? Kindly
> give some guidelines.
>
>
>
>
> StringBuffer sb = new StringBuffer();
>
> Analyzer analyzer = new StandardAnalyzer();
>            boolean createFlag = true;
>        IndexWriter writer =
>                    new IndexWriter(indexDir, analyzer, createFlag);
>
>        for (int i=1;i<=191;i++)  {
>
>            Reader file = new InputStreamReader(new
> FileInputStream("corpus\\doc" + i + ".txt"), "UTF-8");
>
>
>            BufferedReader buff = new BufferedReader(file);
>
>            while( (line = buff.readLine()) != null) {
>                        sb.append(line);
>                }
>
>                Document document  = new Document();
>            document.add(new Field("contents",sb.toString(),
> Field.Store.NO, Field.Index.TOKENIZED));
>            writer.addDocument(document);
>
>            buff.close();
>
>        }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>