You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by peter velthuis <pe...@gmail.com> on 2006/07/03 10:52:15 UTC

Indexing very slow.

When i start the program its fast.. about 10 docs per second. but
after about 15000 it slows down very much. Now it does 1 doc per
second and it is at nr# 40 000 after a whole night indexing. These are
VERY small docs with very little information.. THis is what and how i
index it:

      Document doc = new Document();
                                 doc.add(new Field("field1", field1,
Field.Store.YES,
                        Field.Index.TOKENIZED));
                                 doc.add(new Field("field2", field2,
Field.Store.YES,
                        Field.Index.TOKENIZED));
                                 doc.add(new Field("field3", field3,
Field.Store.YES,
                        Field.Index.TOKENIZED));
                                 doc.add(new Field("field4", field4,
Field.Store.YES,
                        Field.Index.TOKENIZED));
                                 doc.add(new Field("field5", field5,
Field.Store.YES,
                        Field.Index.TOKENIZED));
                                doc.add(new Field("field6", field6,
Field.Store.YES,
                        Field.Index.TOKENIZED));
                                doc.add(new Field("contents",
contents, Field.Store.NO,
                        Field.Index.TOKENIZED));



and this:


    String indexDirectory = "lucdex2";

    private void indexDocument(Document document) throws Exception {
        Analyzer analyzer  = new StandardAnalyzer();
        IndexWriter writer = new IndexWriter(indexDirectory, analyzer, false);
      //  writer.setUseCompoundFile(true);
        writer.addDocument(document);
        writer.optimize();
        writer.close();



I read the data out mysql database.. but that cant be the problem..
since data is in memory.

Also i use cygwin, when i try indexing on windows in a program like
netbeans or BlueJ it crashes windows after about 5000 docs. it sais
"beep" and a complete shutdown...

Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing very slow.

Posted by peter velthuis <pe...@gmail.com>.
hehe that works.. its now racing thourgh 10 000 docs in a couple seconds :)

2006/7/3, Aleksander M. Stensby <al...@integrasco.no>:
> Ah, didnt see that, yeah, you should have something like
>
> new IndexWriter..
>
>         for each document, writer.add
>
> writer.optimize()
> writer.close()
>
> batching it up will make it faster, yes
>
> On Mon, 03 Jul 2006 11:43:03 +0200, Volodymyr Bychkoviak
> <vb...@i-hypergrid.com> wrote:
>
> > Problem is hidden in these lines:
> >  >       writer.optimize();
> >  >       writer.close();
> >
> > You should keep one index writer open for all document additions and
> > close it only after adding last document.
> >
> > Optimize() merges all index segments to single segment and as index
> > grows it takes longer and longer. Optimizing has no impact on indexing
> > so it should be performed once at the end of indexing to optimize index
> > for searching.
> >
> > peter velthuis wrote:
> >> When i start the program its fast.. about 10 docs per second. but
> >> after about 15000 it slows down very much. Now it does 1 doc per
> >> second and it is at nr# 40 000 after a whole night indexing. These are
> >> VERY small docs with very little information.. THis is what and how i
> >> index it:
> >>
> >>      Document doc = new Document();
> >>                                 doc.add(new Field("field1", field1,
> >> Field.Store.YES,
> >>                        Field.Index.TOKENIZED));
> >>                                 doc.add(new Field("field2", field2,
> >> Field.Store.YES,
> >>                        Field.Index.TOKENIZED));
> >>                                 doc.add(new Field("field3", field3,
> >> Field.Store.YES,
> >>                        Field.Index.TOKENIZED));
> >>                                 doc.add(new Field("field4", field4,
> >> Field.Store.YES,
> >>                        Field.Index.TOKENIZED));
> >>                                 doc.add(new Field("field5", field5,
> >> Field.Store.YES,
> >>                        Field.Index.TOKENIZED));
> >>                                doc.add(new Field("field6", field6,
> >> Field.Store.YES,
> >>                        Field.Index.TOKENIZED));
> >>                                doc.add(new Field("contents",
> >> contents, Field.Store.NO,
> >>                        Field.Index.TOKENIZED));
> >>
> >>
> >>
> >> and this:
> >>
> >>
> >>    String indexDirectory = "lucdex2";
> >>
> >>    private void indexDocument(Document document) throws Exception {
> >>        Analyzer analyzer  = new StandardAnalyzer();
> >>        IndexWriter writer = new IndexWriter(indexDirectory, analyzer,
> >> false);
> >>      //  writer.setUseCompoundFile(true);
> >>        writer.addDocument(document);
> >>        writer.optimize();
> >>        writer.close();
> >>
> >>
> >>
> >> I read the data out mysql database.. but that cant be the problem..
> >> since data is in memory.
> >>
> >> Also i use cygwin, when i try indexing on windows in a program like
> >> netbeans or BlueJ it crashes windows after about 5000 docs. it sais
> >> "beep" and a complete shutdown...
> >>
> >> Peter
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> Aleksander M. Stensby
> Software Developer
> Integrasco A/S
> aleksander.stensby@integrasco.no
> Tlf.: +47 41 22 82 72
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing very slow.

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
Ah, didnt see that, yeah, you should have something like

new IndexWriter..

	for each document, writer.add

writer.optimize()
writer.close()

batching it up will make it faster, yes

On Mon, 03 Jul 2006 11:43:03 +0200, Volodymyr Bychkoviak  
<vb...@i-hypergrid.com> wrote:

> Problem is hidden in these lines:
>  >       writer.optimize();
>  >       writer.close();
>
> You should keep one index writer open for all document additions and
> close it only after adding last document.
>
> Optimize() merges all index segments to single segment and as index
> grows it takes longer and longer. Optimizing has no impact on indexing
> so it should be performed once at the end of indexing to optimize index
> for searching.
>
> peter velthuis wrote:
>> When i start the program its fast.. about 10 docs per second. but
>> after about 15000 it slows down very much. Now it does 1 doc per
>> second and it is at nr# 40 000 after a whole night indexing. These are
>> VERY small docs with very little information.. THis is what and how i
>> index it:
>>
>>      Document doc = new Document();
>>                                 doc.add(new Field("field1", field1,
>> Field.Store.YES,
>>                        Field.Index.TOKENIZED));
>>                                 doc.add(new Field("field2", field2,
>> Field.Store.YES,
>>                        Field.Index.TOKENIZED));
>>                                 doc.add(new Field("field3", field3,
>> Field.Store.YES,
>>                        Field.Index.TOKENIZED));
>>                                 doc.add(new Field("field4", field4,
>> Field.Store.YES,
>>                        Field.Index.TOKENIZED));
>>                                 doc.add(new Field("field5", field5,
>> Field.Store.YES,
>>                        Field.Index.TOKENIZED));
>>                                doc.add(new Field("field6", field6,
>> Field.Store.YES,
>>                        Field.Index.TOKENIZED));
>>                                doc.add(new Field("contents",
>> contents, Field.Store.NO,
>>                        Field.Index.TOKENIZED));
>>
>>
>>
>> and this:
>>
>>
>>    String indexDirectory = "lucdex2";
>>
>>    private void indexDocument(Document document) throws Exception {
>>        Analyzer analyzer  = new StandardAnalyzer();
>>        IndexWriter writer = new IndexWriter(indexDirectory, analyzer,
>> false);
>>      //  writer.setUseCompoundFile(true);
>>        writer.addDocument(document);
>>        writer.optimize();
>>        writer.close();
>>
>>
>>
>> I read the data out mysql database.. but that cant be the problem..
>> since data is in memory.
>>
>> Also i use cygwin, when i try indexing on windows in a program like
>> netbeans or BlueJ it crashes windows after about 5000 docs. it sais
>> "beep" and a complete shutdown...
>>
>> Peter
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing very slow.

Posted by Volodymyr Bychkoviak <vb...@i-hypergrid.com>.
Problem is hidden in these lines:
 >       writer.optimize();
 >       writer.close();

You should keep one index writer open for all document additions and 
close it only after adding last document.

Optimize() merges all index segments to single segment and as index 
grows it takes longer and longer. Optimizing has no impact on indexing 
so it should be performed once at the end of indexing to optimize index 
for searching.

peter velthuis wrote:
> When i start the program its fast.. about 10 docs per second. but
> after about 15000 it slows down very much. Now it does 1 doc per
> second and it is at nr# 40 000 after a whole night indexing. These are
> VERY small docs with very little information.. THis is what and how i
> index it:
>
>      Document doc = new Document();
>                                 doc.add(new Field("field1", field1,
> Field.Store.YES,
>                        Field.Index.TOKENIZED));
>                                 doc.add(new Field("field2", field2,
> Field.Store.YES,
>                        Field.Index.TOKENIZED));
>                                 doc.add(new Field("field3", field3,
> Field.Store.YES,
>                        Field.Index.TOKENIZED));
>                                 doc.add(new Field("field4", field4,
> Field.Store.YES,
>                        Field.Index.TOKENIZED));
>                                 doc.add(new Field("field5", field5,
> Field.Store.YES,
>                        Field.Index.TOKENIZED));
>                                doc.add(new Field("field6", field6,
> Field.Store.YES,
>                        Field.Index.TOKENIZED));
>                                doc.add(new Field("contents",
> contents, Field.Store.NO,
>                        Field.Index.TOKENIZED));
>
>
>
> and this:
>
>
>    String indexDirectory = "lucdex2";
>
>    private void indexDocument(Document document) throws Exception {
>        Analyzer analyzer  = new StandardAnalyzer();
>        IndexWriter writer = new IndexWriter(indexDirectory, analyzer, 
> false);
>      //  writer.setUseCompoundFile(true);
>        writer.addDocument(document);
>        writer.optimize();
>        writer.close();
>
>
>
> I read the data out mysql database.. but that cant be the problem..
> since data is in memory.
>
> Also i use cygwin, when i try indexing on windows in a program like
> netbeans or BlueJ it crashes windows after about 5000 docs. it sais
> "beep" and a complete shutdown...
>
> Peter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
regards,
Volodymyr Bychkoviak


Re: Indexing very slow.

Posted by peter velthuis <pe...@gmail.com>.
I select it in parts, chunks of 5000  records with the limit keyword..
the thing is it starts very fast..but then slows down so i doubt it
has to do with tokenizing


2006/7/3, Aleksander M. Stensby <al...@integrasco.no>:
> My guess is if that you actually do a complete select * from you db, and
> manage all objects all at once, this will be a problem for your jvm, maybe
> running out of memory is the problem you encounter, strings tend to be a
> bit of a memory issue in java :(
>
> My suggestion is that you do paginating and offsetting while getting from
> db (and indexing the results).
>
> so.. you could managed your "last indexed" doc, predefine a "step"
> variable, then select with a limit of lastindexed, step, then updating
> your last indexed variable for each iteration. Of course the step variable
> should not be too low either, because this will demand to much load on the
> database connection.
>
> Also, you might reconsider tokenizing all fields... is it neccessary? You
> have to store all of them? (i dont know the usage of your index, so its a
> bit hard to know for sure)
>
> - Aleksander
>
> On Mon, 03 Jul 2006 10:52:15 +0200, peter velthuis <pe...@gmail.com>
> wrote:
>
> > When i start the program its fast.. about 10 docs per second. but
> > after about 15000 it slows down very much. Now it does 1 doc per
> > second and it is at nr# 40 000 after a whole night indexing. These are
> > VERY small docs with very little information.. THis is what and how i
> > index it:
> >
> >       Document doc = new Document();
> >                                  doc.add(new Field("field1", field1,
> > Field.Store.YES,
> >                         Field.Index.TOKENIZED));
> >                                  doc.add(new Field("field2", field2,
> > Field.Store.YES,
> >                         Field.Index.TOKENIZED));
> >                                  doc.add(new Field("field3", field3,
> > Field.Store.YES,
> >                         Field.Index.TOKENIZED));
> >                                  doc.add(new Field("field4", field4,
> > Field.Store.YES,
> >                         Field.Index.TOKENIZED));
> >                                  doc.add(new Field("field5", field5,
> > Field.Store.YES,
> >                         Field.Index.TOKENIZED));
> >                                 doc.add(new Field("field6", field6,
> > Field.Store.YES,
> >                         Field.Index.TOKENIZED));
> >                                 doc.add(new Field("contents",
> > contents, Field.Store.NO,
> >                         Field.Index.TOKENIZED));
> >
> >
> >
> > and this:
> >
> >
> >     String indexDirectory = "lucdex2";
> >
> >     private void indexDocument(Document document) throws Exception {
> >         Analyzer analyzer  = new StandardAnalyzer();
> >         IndexWriter writer = new IndexWriter(indexDirectory, analyzer,
> > false);
> >       //  writer.setUseCompoundFile(true);
> >         writer.addDocument(document);
> >         writer.optimize();
> >         writer.close();
> >
> >
> >
> > I read the data out mysql database.. but that cant be the problem..
> > since data is in memory.
> >
> > Also i use cygwin, when i try indexing on windows in a program like
> > netbeans or BlueJ it crashes windows after about 5000 docs. it sais
> > "beep" and a complete shutdown...
> >
> > Peter
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
> --
> Aleksander M. Stensby
> Software Developer
> Integrasco A/S
> aleksander.stensby@integrasco.no
> Tlf.: +47 41 22 82 72
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing very slow.

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
My guess is if that you actually do a complete select * from you db, and  
manage all objects all at once, this will be a problem for your jvm, maybe  
running out of memory is the problem you encounter, strings tend to be a  
bit of a memory issue in java :(

My suggestion is that you do paginating and offsetting while getting from  
db (and indexing the results).

so.. you could managed your "last indexed" doc, predefine a "step"  
variable, then select with a limit of lastindexed, step, then updating  
your last indexed variable for each iteration. Of course the step variable  
should not be too low either, because this will demand to much load on the  
database connection.

Also, you might reconsider tokenizing all fields... is it neccessary? You  
have to store all of them? (i dont know the usage of your index, so its a  
bit hard to know for sure)

- Aleksander

On Mon, 03 Jul 2006 10:52:15 +0200, peter velthuis <pe...@gmail.com>  
wrote:

> When i start the program its fast.. about 10 docs per second. but
> after about 15000 it slows down very much. Now it does 1 doc per
> second and it is at nr# 40 000 after a whole night indexing. These are
> VERY small docs with very little information.. THis is what and how i
> index it:
>
>       Document doc = new Document();
>                                  doc.add(new Field("field1", field1,
> Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                                  doc.add(new Field("field2", field2,
> Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                                  doc.add(new Field("field3", field3,
> Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                                  doc.add(new Field("field4", field4,
> Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                                  doc.add(new Field("field5", field5,
> Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                                 doc.add(new Field("field6", field6,
> Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                                 doc.add(new Field("contents",
> contents, Field.Store.NO,
>                         Field.Index.TOKENIZED));
>
>
>
> and this:
>
>
>     String indexDirectory = "lucdex2";
>
>     private void indexDocument(Document document) throws Exception {
>         Analyzer analyzer  = new StandardAnalyzer();
>         IndexWriter writer = new IndexWriter(indexDirectory, analyzer,  
> false);
>       //  writer.setUseCompoundFile(true);
>         writer.addDocument(document);
>         writer.optimize();
>         writer.close();
>
>
>
> I read the data out mysql database.. but that cant be the problem..
> since data is in memory.
>
> Also i use cygwin, when i try indexing on windows in a program like
> netbeans or BlueJ it crashes windows after about 5000 docs. it sais
> "beep" and a complete shutdown...
>
> Peter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no
Tlf.: +47 41 22 82 72

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org