You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Jayakumar.V" <ja...@uaeexchange.com> on 2005/04/23 23:44:03 UTC

lucene indexing performance

Hi,

Maybe this query has been answered before. My first email to this user group
did not generate any response. I had forwarded it to the following email ids
:  

java-user-info@lucene.apache.org

java-user@lucene.apache.org

 

This is my second email to this mail id. Hope I've reached the right place.

 

We are indexing documents on a scheduled basis. A document which was indexed
at time T1 will be available again for indexing at time T2 with certain
additional fields. Now, I need to ensure that only the document received at
time T2 is present in the index, for which I need to first identify if the
record is present in the index & then delete it before indexing the same.
I've taken the cue from a code snippet available in the TSS case study in
the book Lucene In Action. 

 

The steps I've followed is as below :

-          prepare the Document for indexing

-          close the existing  IndexWriter instance

-          get an IndexReader instance to the index

-          check if the record going to be indexed is already available in
the index 

-          if YES, delete it & close the IndexReader instance

-          open the IndexWriter instance again

-          add the Document to the index

 

Now, this is an iterative process for each record being indexed. Is it the
right way to go about doing this? It took nearly 3 hours to index 250,000
records.

 

I'm attaching the code snippet used in my app. for deleting & adding the
record :

 

    private void addIndex(Document doc, Map dataMap) {

        IndexReader indexReader = null;

        

        // check if the doc. is already indexed.

        // if YES, first remove it b4 adding the document

        try {

            // first, close the undelying IndexWriter instance

            // v can't have 2 index modifying instances open at the same
time

            closeWriterIndex();

            

            // get an IndexReader instance

            indexReader = IndexReader.open(fsDir);

            // get a Term obj. for deletion

            Term term = new Term("xpin",(String)dataMap.get("xpin"));

            // now, remove the already added doc.

            indexReader.delete(term);

      } catch (IOException e) {

            e.printStackTrace();

      } finally {

          try {

              // close the reader instance after deleting the doc.

              indexReader.close();

          } catch (IOException e) {

            e.printStackTrace();

          }

      }

         

      try {

            // now, reopen the index writer object

            openWriterIndex();

 

            // index the document

            fsWriter.addDocument(doc);

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

 

    private void closeWriterIndex() {

      try {

            fsWriter.close();

      } catch (IOException e) {

            e.printStackTrace();

      }

    }

 

    private void openWriterIndex() {

      try {

            fsWriter = new IndexWriter(fsDir, new StandardAnalyzer(),
false);

            fsWriter.mergeFactor = 100;

      } catch (IOException e) {

            e.printStackTrace();

      }  

    }

 

I'm at the final stages of deploying this module. Any suggestions / ideas
will be helpful in completing it fast. 

 

 

TIA 

Jayakumar.V

RE: lucene indexing performance

Posted by "Jayakumar.V" <ja...@uaeexchange.com>.

Hi Chuck,
Sorry for the delayed response & thanks for the input. As suggested, I've
batched my addition / deletion process & there is a reasonably better
performance gain. 

Jayakumar.V

-----Original Message-----
From: Chuck Williams [mailto:chuck@allthingslocal.com] 
Sent: Sunday, April 24, 2005 1:58 AM
To: java-user@lucene.apache.org
Subject: Re: lucene indexing performance

One immediate optimization would be to only close the writer and open 
the reader if the document is present.  You can have a reader open and 
do searches while indexing (and optimization) are underway.  It's just 
the delete operation that requires you to close the writer (so you don't 
have two different objects trying the update the same index).

However, that is rendered moot by the much bigger optimization.  Open 
the reader once, do all your deletes, close the reader, and then do all 
your adds.  I.e., batch your updates as much as possible.

If you have to them one at a time, then the first optimization should 
help some.

Chuck

Jayakumar.V wrote:

>Hi,
>
>Maybe this query has been answered before. My first email to this user
group
>did not generate any response. I had forwarded it to the following email
ids
>:  
>
>java-user-info@lucene.apache.org
>
>java-user@lucene.apache.org
>
> 
>
>This is my second email to this mail id. Hope I've reached the right place.
>
> 
>
>We are indexing documents on a scheduled basis. A document which was
indexed
>at time T1 will be available again for indexing at time T2 with certain
>additional fields. Now, I need to ensure that only the document received at
>time T2 is present in the index, for which I need to first identify if the
>record is present in the index & then delete it before indexing the same.
>I've taken the cue from a code snippet available in the TSS case study in
>the book Lucene In Action. 
>
> 
>
>The steps I've followed is as below :
>
>-          prepare the Document for indexing
>
>-          close the existing  IndexWriter instance
>
>-          get an IndexReader instance to the index
>
>-          check if the record going to be indexed is already available in
>the index 
>
>-          if YES, delete it & close the IndexReader instance
>
>-          open the IndexWriter instance again
>
>-          add the Document to the index
>
> 
>
>Now, this is an iterative process for each record being indexed. Is it the
>right way to go about doing this? It took nearly 3 hours to index 250,000
>records.
>
> 
>
>I'm attaching the code snippet used in my app. for deleting & adding the
>record :
>
> 
>
>    private void addIndex(Document doc, Map dataMap) {
>
>        IndexReader indexReader = null;
>
>        
>
>        // check if the doc. is already indexed.
>
>        // if YES, first remove it b4 adding the document
>
>        try {
>
>            // first, close the undelying IndexWriter instance
>
>            // v can't have 2 index modifying instances open at the same
>time
>
>            closeWriterIndex();
>
>            
>
>            // get an IndexReader instance
>
>            indexReader = IndexReader.open(fsDir);
>
>            // get a Term obj. for deletion
>
>            Term term = new Term("xpin",(String)dataMap.get("xpin"));
>
>            // now, remove the already added doc.
>
>            indexReader.delete(term);
>
>      } catch (IOException e) {
>
>            e.printStackTrace();
>
>      } finally {
>
>          try {
>
>              // close the reader instance after deleting the doc.
>
>              indexReader.close();
>
>          } catch (IOException e) {
>
>            e.printStackTrace();
>
>          }
>
>      }
>
>         
>
>      try {
>
>            // now, reopen the index writer object
>
>            openWriterIndex();
>
> 
>
>            // index the document
>
>            fsWriter.addDocument(doc);
>
>        } catch (IOException e) {
>
>            e.printStackTrace();
>
>        }
>
>    }
>
> 
>
>    private void closeWriterIndex() {
>
>      try {
>
>            fsWriter.close();
>
>      } catch (IOException e) {
>
>            e.printStackTrace();
>
>      }
>
>    }
>
> 
>
>    private void openWriterIndex() {
>
>      try {
>
>            fsWriter = new IndexWriter(fsDir, new StandardAnalyzer(),
>false);
>
>            fsWriter.mergeFactor = 100;
>
>      } catch (IOException e) {
>
>            e.printStackTrace();
>
>      }  
>
>    }
>
> 
>
>I'm at the final stages of deploying this module. Any suggestions / ideas
>will be helpful in completing it fast. 
>
> 
>
> 
>
>TIA 
>
>Jayakumar.V
>
> 
>
>
>  
>


-- 
*Chuck Williams*
All Things Local
Founder and CEO
V: (415)464-1889
C: (415)846-9018
chuck@AllThingsLocal.com <ma...@AllThingsLocal.com>
AIM: hawimanawiz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene indexing performance

Posted by Chuck Williams <ch...@allthingslocal.com>.

One immediate optimization would be to only close the writer and open 
the reader if the document is present.  You can have a reader open and 
do searches while indexing (and optimization) are underway.  It's just 
the delete operation that requires you to close the writer (so you don't 
have two different objects trying the update the same index).

However, that is rendered moot by the much bigger optimization.  Open 
the reader once, do all your deletes, close the reader, and then do all 
your adds.  I.e., batch your updates as much as possible.

If you have to them one at a time, then the first optimization should 
help some.

Chuck

Jayakumar.V wrote:

>Hi,
>
>Maybe this query has been answered before. My first email to this user group
>did not generate any response. I had forwarded it to the following email ids
>:  
>
>java-user-info@lucene.apache.org
>
>java-user@lucene.apache.org
>
> 
>
>This is my second email to this mail id. Hope I've reached the right place.
>
> 
>
>We are indexing documents on a scheduled basis. A document which was indexed
>at time T1 will be available again for indexing at time T2 with certain
>additional fields. Now, I need to ensure that only the document received at
>time T2 is present in the index, for which I need to first identify if the
>record is present in the index & then delete it before indexing the same.
>I've taken the cue from a code snippet available in the TSS case study in
>the book Lucene In Action. 
>
> 
>
>The steps I've followed is as below :
>
>-          prepare the Document for indexing
>
>-          close the existing  IndexWriter instance
>
>-          get an IndexReader instance to the index
>
>-          check if the record going to be indexed is already available in
>the index 
>
>-          if YES, delete it & close the IndexReader instance
>
>-          open the IndexWriter instance again
>
>-          add the Document to the index
>
> 
>
>Now, this is an iterative process for each record being indexed. Is it the
>right way to go about doing this? It took nearly 3 hours to index 250,000
>records.
>
> 
>
>I'm attaching the code snippet used in my app. for deleting & adding the
>record :
>
> 
>
>    private void addIndex(Document doc, Map dataMap) {
>
>        IndexReader indexReader = null;
>
>        
>
>        // check if the doc. is already indexed.
>
>        // if YES, first remove it b4 adding the document
>
>        try {
>
>            // first, close the undelying IndexWriter instance
>
>            // v can't have 2 index modifying instances open at the same
>time
>
>            closeWriterIndex();
>
>            
>
>            // get an IndexReader instance
>
>            indexReader = IndexReader.open(fsDir);
>
>            // get a Term obj. for deletion
>
>            Term term = new Term("xpin",(String)dataMap.get("xpin"));
>
>            // now, remove the already added doc.
>
>            indexReader.delete(term);
>
>      } catch (IOException e) {
>
>            e.printStackTrace();
>
>      } finally {
>
>          try {
>
>              // close the reader instance after deleting the doc.
>
>              indexReader.close();
>
>          } catch (IOException e) {
>
>            e.printStackTrace();
>
>          }
>
>      }
>
>         
>
>      try {
>
>            // now, reopen the index writer object
>
>            openWriterIndex();
>
> 
>
>            // index the document
>
>            fsWriter.addDocument(doc);
>
>        } catch (IOException e) {
>
>            e.printStackTrace();
>
>        }
>
>    }
>
> 
>
>    private void closeWriterIndex() {
>
>      try {
>
>            fsWriter.close();
>
>      } catch (IOException e) {
>
>            e.printStackTrace();
>
>      }
>
>    }
>
> 
>
>    private void openWriterIndex() {
>
>      try {
>
>            fsWriter = new IndexWriter(fsDir, new StandardAnalyzer(),
>false);
>
>            fsWriter.mergeFactor = 100;
>
>      } catch (IOException e) {
>
>            e.printStackTrace();
>
>      }  
>
>    }
>
> 
>
>I'm at the final stages of deploying this module. Any suggestions / ideas
>will be helpful in completing it fast. 
>
> 
>
> 
>
>TIA 
>
>Jayakumar.V
>
> 
>
>
>  
>


-- 
*Chuck Williams*
All Things Local
Founder and CEO
V: (415)464-1889
C: (415)846-9018
chuck@AllThingsLocal.com <ma...@AllThingsLocal.com>
AIM: hawimanawiz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org