You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Paul Sondag <js...@uiuc.edu> on 2007/08/11 19:05:46 UTC

Indexing correctly?

Hi,

I was hoping that maybe you guys could see if I'm somehow indexing
inefficiently.  I'm putting relevant parts of my code below.  I've looked at
the "benchmarks" page on Lucene and my indexing time is taking a substantial
amount of time more than what I see posted.  I'm not sure when I should call
flush() ( I saw that I should be doing that on the ImproveIndexingSpeed
page).  I'd really appreciate any advice.

Here's my code:

      File directory = new File( "/mounts/falcon5/disks/0/tcheng3/Dataset");
      File[] theFiles = directory.listFiles();

          //go through each file inside the directory and index it
          for(int curFile = 0; curFile < theFiles.length; curFile++)
          {
              File fin=theFiles[curFile];

              //open up the file
              FileInputStream inf = new FileInputStream(fin);
              InputStreamReader isr = new InputStreamReader(inf,
"US-ASCII");
              BufferedReader in = new BufferedReader(isr);
              String text="";
              String docid="";

            while (true) {

            //read in the file one line at a time, and act accordingly
                String line = in.readLine();
                if (line == null) { break;}

                 if (line.startsWith("<DOC>") ) {
                    //get docID
                    line = in.readLine();
                    String tempStr = line.substring(8,line.length());
                    int pos = tempStr.indexOf(' ');
                    docid = tempStr.substring(0,pos);
                    }else if (line.startsWith("</DOC>")) {

                    Document doc = new Document();

                      doc.add(new Field("contents",text, Field.Store.NO,
Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
                    doc.add(new Field("DocID",docid, Field.Store.YES,
Field.Index.NO));
                      writer.addDocument(doc);
                    text="";
                } else {
                    text = text + "\n" + line;
                }
            }

        }


          int numIndexed = writer.docCount();

          writer.optimize();
          writer.close();


Thanks,

--JP

Re: Indexing correctly?

Posted by Erick Erickson <er...@gmail.com>.

Where are your source files and index? If they're somewhere
out there on the network, you may be having some slowdown
because of network latency (the part about "/mount/....." leads
me to ask this one).

If this is the case, you might get an improvement if all the files are
local...

Best
Erick

On 8/11/07, John Paul Sondag <js...@uiuc.edu> wrote:
>
> It takes roughly 6 hours for me to index a Gig of data.  The benchmarks
> take
> quite a bit less if I'm reading it correctly.  I'll try out the
> StringBuffer/Builder and let you know.  Thanks for the quick response and
> if
> you have any more suggestions please let me know.
>
> --JP
>
> On 8/11/07, karl wettin <ka...@gmail.com> wrote:
> >
> > How much slower than anticipated is it?
> >
> > I would start by using a StringBuffer/Builder rather than appending
> > (immutable) strings to each other.
> >
> >
> > 11 aug 2007 kl. 19.05 skrev John Paul Sondag:
> >
> > > Hi,
> > >
> > > I was hoping that maybe you guys could see if I'm somehow indexing
> > > inefficiently.  I'm putting relevant parts of my code below.  I've
> > > looked at
> > > the "benchmarks" page on Lucene and my indexing time is taking a
> > > substantial
> > > amount of time more than what I see posted.  I'm not sure when I
> > > should call
> > > flush() ( I saw that I should be doing that on the
> > > ImproveIndexingSpeed
> > > page).  I'd really appreciate any advice.
> > >
> > > Here's my code:
> > >
> > >       File directory = new File( "/mounts/falcon5/disks/0/tcheng3/
> > > Dataset");
> > >       File[] theFiles = directory.listFiles();
> > >
> > >           //go through each file inside the directory and index it
> > >           for(int curFile = 0; curFile < theFiles.length; curFile++)
> > >           {
> > >               File fin=theFiles[curFile];
> > >
> > >               //open up the file
> > >               FileInputStream inf = new FileInputStream(fin);
> > >               InputStreamReader isr = new InputStreamReader(inf,
> > > "US-ASCII");
> > >               BufferedReader in = new BufferedReader(isr);
> > >               String text="";
> > >               String docid="";
> > >
> > >             while (true) {
> > >
> > >             //read in the file one line at a time, and act accordingly
> > >                 String line = in.readLine();
> > >                 if (line == null) { break;}
> > >
> > >                  if (line.startsWith("<DOC>") ) {
> > >                     //get docID
> > >                     line = in.readLine();
> > >                     String tempStr = line.substring(8,line.length());
> > >                     int pos = tempStr.indexOf(' ');
> > >                     docid = tempStr.substring(0,pos);
> > >                     }else if (line.startsWith("</DOC>")) {
> > >
> > >                     Document doc = new Document();
> > >
> > >                       doc.add(new Field("contents",text,
> > > Field.Store.NO,
> > > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
> > >                     doc.add(new Field("DocID",docid, Field.Store.YES,
> > > Field.Index.NO));
> > >                       writer.addDocument(doc);
> > >                     text="";
> > >                 } else {
> > >                     text = text + "\n" + line;
> > >                 }
> > >             }
> > >
> > >         }
> > >
> > >
> > >           int numIndexed = writer.docCount();
> > >
> > >           writer.optimize();
> > >           writer.close();
> > >
> > >
> > > Thanks,
> > >
> > > --JP
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Indexing correctly?

Posted by John Paul Sondag <js...@uiuc.edu>.

It takes roughly 6 hours for me to index a Gig of data.  The benchmarks take
quite a bit less if I'm reading it correctly.  I'll try out the
StringBuffer/Builder and let you know.  Thanks for the quick response and if
you have any more suggestions please let me know.

--JP

On 8/11/07, karl wettin <ka...@gmail.com> wrote:
>
> How much slower than anticipated is it?
>
> I would start by using a StringBuffer/Builder rather than appending
> (immutable) strings to each other.
>
>
> 11 aug 2007 kl. 19.05 skrev John Paul Sondag:
>
> > Hi,
> >
> > I was hoping that maybe you guys could see if I'm somehow indexing
> > inefficiently.  I'm putting relevant parts of my code below.  I've
> > looked at
> > the "benchmarks" page on Lucene and my indexing time is taking a
> > substantial
> > amount of time more than what I see posted.  I'm not sure when I
> > should call
> > flush() ( I saw that I should be doing that on the
> > ImproveIndexingSpeed
> > page).  I'd really appreciate any advice.
> >
> > Here's my code:
> >
> >       File directory = new File( "/mounts/falcon5/disks/0/tcheng3/
> > Dataset");
> >       File[] theFiles = directory.listFiles();
> >
> >           //go through each file inside the directory and index it
> >           for(int curFile = 0; curFile < theFiles.length; curFile++)
> >           {
> >               File fin=theFiles[curFile];
> >
> >               //open up the file
> >               FileInputStream inf = new FileInputStream(fin);
> >               InputStreamReader isr = new InputStreamReader(inf,
> > "US-ASCII");
> >               BufferedReader in = new BufferedReader(isr);
> >               String text="";
> >               String docid="";
> >
> >             while (true) {
> >
> >             //read in the file one line at a time, and act accordingly
> >                 String line = in.readLine();
> >                 if (line == null) { break;}
> >
> >                  if (line.startsWith("<DOC>") ) {
> >                     //get docID
> >                     line = in.readLine();
> >                     String tempStr = line.substring(8,line.length());
> >                     int pos = tempStr.indexOf(' ');
> >                     docid = tempStr.substring(0,pos);
> >                     }else if (line.startsWith("</DOC>")) {
> >
> >                     Document doc = new Document();
> >
> >                       doc.add(new Field("contents",text,
> > Field.Store.NO,
> > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
> >                     doc.add(new Field("DocID",docid, Field.Store.YES,
> > Field.Index.NO));
> >                       writer.addDocument(doc);
> >                     text="";
> >                 } else {
> >                     text = text + "\n" + line;
> >                 }
> >             }
> >
> >         }
> >
> >
> >           int numIndexed = writer.docCount();
> >
> >           writer.optimize();
> >           writer.close();
> >
> >
> > Thanks,
> >
> > --JP
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing correctly?

Posted by karl wettin <ka...@gmail.com>.

How much slower than anticipated is it?

I would start by using a StringBuffer/Builder rather than appending
(immutable) strings to each other.


11 aug 2007 kl. 19.05 skrev John Paul Sondag:

> Hi,
>
> I was hoping that maybe you guys could see if I'm somehow indexing
> inefficiently.  I'm putting relevant parts of my code below.  I've  
> looked at
> the "benchmarks" page on Lucene and my indexing time is taking a  
> substantial
> amount of time more than what I see posted.  I'm not sure when I  
> should call
> flush() ( I saw that I should be doing that on the  
> ImproveIndexingSpeed
> page).  I'd really appreciate any advice.
>
> Here's my code:
>
>       File directory = new File( "/mounts/falcon5/disks/0/tcheng3/ 
> Dataset");
>       File[] theFiles = directory.listFiles();
>
>           //go through each file inside the directory and index it
>           for(int curFile = 0; curFile < theFiles.length; curFile++)
>           {
>               File fin=theFiles[curFile];
>
>               //open up the file
>               FileInputStream inf = new FileInputStream(fin);
>               InputStreamReader isr = new InputStreamReader(inf,
> "US-ASCII");
>               BufferedReader in = new BufferedReader(isr);
>               String text="";
>               String docid="";
>
>             while (true) {
>
>             //read in the file one line at a time, and act accordingly
>                 String line = in.readLine();
>                 if (line == null) { break;}
>
>                  if (line.startsWith("<DOC>") ) {
>                     //get docID
>                     line = in.readLine();
>                     String tempStr = line.substring(8,line.length());
>                     int pos = tempStr.indexOf(' ');
>                     docid = tempStr.substring(0,pos);
>                     }else if (line.startsWith("</DOC>")) {
>
>                     Document doc = new Document();
>
>                       doc.add(new Field("contents",text,  
> Field.Store.NO,
> Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
>                     doc.add(new Field("DocID",docid, Field.Store.YES,
> Field.Index.NO));
>                       writer.addDocument(doc);
>                     text="";
>                 } else {
>                     text = text + "\n" + line;
>                 }
>             }
>
>         }
>
>
>           int numIndexed = writer.docCount();
>
>           writer.optimize();
>           writer.close();
>
>
> Thanks,
>
> --JP


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org