You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2009/10/22 14:45:43 UTC

Performance tips when creating a large index from database.

I'm building a lucene index from a database, creating 1 about 1 million 
documents, unsuprisingly this takes quite a long time.
I do this by sending a query  to the db over a range of ids , (10,000) 
records
Add these results in Lucene
Then get next 10,0000 and so on.
When completed indexing I then call optimize()
I also set  indexWriter.setMaxBufferedDocs(1000) and  
indexWriter.setMergeFactor(3000) but don't fully understand these values.
Each document contains about 10 small fields

I'm looking for some ways to improve performance.

This index writing is single threaded, is there a way I can multi-thread 
writing to the indexing ?
I only call optimize() once at the end, is the best way to do it.
I'm going to run a profiler over the code, but are there any rules of 
thumbs on the best values to set for MaxBufferedDocs and Mergefactor()

thanks Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Paul Taylor <pa...@fastmail.fm>.

Glen Newton wrote:
> You might want to consider using LuSql, which is a high performance,
> multithreaded, well documented tool designed specifically for moving
> data from a JDBC database into Lucene (you didn't say if it was a
> JDBC-accessible db...)
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>
> Disclosure: I am the author of LuSql.
>
> -Glen Newton
>  http://zzzoot.blogspot.com/
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Glen_Newton
>   

Thanks Glen

I have already written all the code for sending queries to the database, 
deciding what fields I want to use ectera, so I wouldn't want to redo 
that to use another tool especially when you offer only a command line 
interface not an api. But I
would be interested in how you use multithreading to best advantage for 
this , would you care to share the basics ?

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Ian Lea <ia...@gmail.com>.

See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed.
That includes some info on merge and buffer factors, and recommends
multiple threads.  When I've done this sort of thing in the past it
has tended to be the database that is the problem, but maybe your
database is faster than mine.  Only calling optimize at the end is
correct.  You don't need to call it at all.


--
Ian.


On Thu, Oct 22, 2009 at 1:52 PM, Glen Newton <gl...@gmail.com> wrote:
> You might want to consider using LuSql, which is a high performance,
> multithreaded, well documented tool designed specifically for moving
> data from a JDBC database into Lucene (you didn't say if it was a
> JDBC-accessible db...)
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>
> Disclosure: I am the author of LuSql.
>
> -Glen Newton
>  http://zzzoot.blogspot.com/
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Glen_Newton
>
>
> 2009/10/22 Paul Taylor <pa...@fastmail.fm>:
>> I'm building a lucene index from a database, creating 1 about 1 million
>> documents, unsuprisingly this takes quite a long time.
>> I do this by sending a query  to the db over a range of ids , (10,000)
>> records
>> Add these results in Lucene
>> Then get next 10,0000 and so on.
>> When completed indexing I then call optimize()
>> I also set  indexWriter.setMaxBufferedDocs(1000) and
>>  indexWriter.setMergeFactor(3000) but don't fully understand these values.
>> Each document contains about 10 small fields
>>
>> I'm looking for some ways to improve performance.
>>
>> This index writing is single threaded, is there a way I can multi-thread
>> writing to the indexing ?
>> I only call optimize() once at the end, is the best way to do it.
>> I'm going to run a profiler over the code, but are there any rules of thumbs
>> on the best values to set for MaxBufferedDocs and Mergefactor()
>>
>> thanks Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Glen Newton <gl...@gmail.com>.

You might want to consider using LuSql, which is a high performance,
multithreaded, well documented tool designed specifically for moving
data from a JDBC database into Lucene (you didn't say if it was a
JDBC-accessible db...)
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclosure: I am the author of LuSql.

-Glen Newton
 http://zzzoot.blogspot.com/
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Glen_Newton


2009/10/22 Paul Taylor <pa...@fastmail.fm>:
> I'm building a lucene index from a database, creating 1 about 1 million
> documents, unsuprisingly this takes quite a long time.
> I do this by sending a query  to the db over a range of ids , (10,000)
> records
> Add these results in Lucene
> Then get next 10,0000 and so on.
> When completed indexing I then call optimize()
> I also set  indexWriter.setMaxBufferedDocs(1000) and
>  indexWriter.setMergeFactor(3000) but don't fully understand these values.
> Each document contains about 10 small fields
>
> I'm looking for some ways to improve performance.
>
> This index writing is single threaded, is there a way I can multi-thread
> writing to the indexing ?
> I only call optimize() once at the end, is the best way to do it.
> I'm going to run a profiler over the code, but are there any rules of thumbs
> on the best values to set for MaxBufferedDocs and Mergefactor()
>
> thanks Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Marcelo Ochoa <ma...@gmail.com>.

Hi Paul:
   Mostly of the time indexing big tables is spent on the table full
scan and network data transfer.
   Please take a quick look at my OOW08 presentation about Oracle
Lucene integration:
          http://docs.google.com/present/view?id=ddgw7sjp_156gf9hczxv
    specially slides 13 and 14 which shows time involved during a
WikiPedia dump indexing inside an Oracle database.
    Best regards, Marcelo.
On Thu, Oct 22, 2009 at 9:45 AM, Paul Taylor <pa...@fastmail.fm> wrote:
> I'm building a lucene index from a database, creating 1 about 1 million
> documents, unsuprisingly this takes quite a long time.
> I do this by sending a query  to the db over a range of ids , (10,000)
> records
> Add these results in Lucene
> Then get next 10,0000 and so on.
> When completed indexing I then call optimize()
> I also set  indexWriter.setMaxBufferedDocs(1000) and
>  indexWriter.setMergeFactor(3000) but don't fully understand these values.
> Each document contains about 10 small fields
>
> I'm looking for some ways to improve performance.
>
> This index writing is single threaded, is there a way I can multi-thread
> writing to the indexing ?
> I only call optimize() once at the end, is the best way to do it.
> I'm going to run a profiler over the code, but are there any rules of thumbs
> on the best values to set for MaxBufferedDocs and Mergefactor()
>
> thanks Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Want to integrate Lucene and Oracle?
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
Is Oracle 11g REST ready?
http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Thu, 2009-10-22 at 15:14 +0200, Erick Erickson wrote:
> Besides the other suggestions, I'd really, really, really put
> some instrumentationin the code and see where you're spending your time. For
> a fast hint, put
> a cumulative timer around your indexing part only. This will indicate
> whether
> the time is consumed in querying your database or indexing......

I'll second that. We use this in most of our outer methods and log the
time used on DEBUG level (and INFO in a few select cases).

For pinpointing bottlenecks I'll also recommend VisualVM. It ships with
Java, but the newest version can be found at
https://visualvm.dev.java.net/ 

The beauty of this tool is that it requires no preparation. Just start
is and connect to a running Java program. It provides detailed CPU-usage
and RAM-allocation statistics.

- Toke

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Erick Erickson <er...@gmail.com>.

Besides the other suggestions, I'd really, really, really put
some instrumentationin the code and see where you're spending your time. For
a fast hint, put
a cumulative timer around your indexing part only. This will indicate
whether
the time is consumed in querying your database or indexing......

I'd also just use the default merge factor etc. until you answer
this question. They weren't just chosen at random <G>.

Something like this.
long elapsed = 0;
while (more database records) {
    get database record
    long temp = get time
    index Lucene doc
    elapsed += (get time) - temp
}
print time

It's unclear from your message whether you're calling optimize after
every 10,000 docs or not, but don't.

But I really suspect you're spending time interacting with the DB.

If these responses aren't all that helpful, some code samples
would help...

Best
Erick

On Thu, Oct 22, 2009 at 8:45 AM, Paul Taylor <pa...@fastmail.fm> wrote:

> I'm building a lucene index from a database, creating 1 about 1 million
> documents, unsuprisingly this takes quite a long time.
> I do this by sending a query  to the db over a range of ids , (10,000)
> records
> Add these results in Lucene
> Then get next 10,0000 and so on.
> When completed indexing I then call optimize()
> I also set  indexWriter.setMaxBufferedDocs(1000) and
>  indexWriter.setMergeFactor(3000) but don't fully understand these values.
> Each document contains about 10 small fields
>
> I'm looking for some ways to improve performance.
>
> This index writing is single threaded, is there a way I can multi-thread
> writing to the indexing ?
> I only call optimize() once at the end, is the best way to do it.
> I'm going to run a profiler over the code, but are there any rules of
> thumbs on the best values to set for MaxBufferedDocs and Mergefactor()
>
> thanks Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Performance tips when creating a large index from database.

Posted by Chris Lu <ch...@gmail.com>.

All previous suggestions are very good.

It's usually just the database. Lucene itself are faster enough.
Previously when I used Pentium III years ago, the indexing speed matters.
But upgrading the CPU to Xeon etc, the indexing bottle neck is on 
database side.

Basically use the simplest SQL as you can, and maybe do some caching to 
save a trip to database.
Sometimes, you can use your way to batch a set of ids. But if you set 
the batch size to a big number, like your 10000 ids, it could causing 
database memory page swapping and slow down your database. The other 
extreme is to just use 1 id, but many threads.
I think you can find something in middle, which set batch size to a 
lower number, and use more threads, to pull may speed things up.

You can also use the community version of DBSight to try this. You can 
tell how much time spend on querying and indexing respectively, and you 
can adjust the number of threads for database queries and also for 
indexing to find out your optimal data pulling configuration.

--

Chris Lu

-------------------------

Instant Scalable Full-Text Search On Any Database/Application

site: http://www.dbsight.net

demo: http://search.dbsight.com

Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!



Glen Newton wrote:
> This is basically what LuSql does. The time increases ("8h to 30 min")
> are similar. Usually on the order of an order of magnitude.
>
> Oh, the comments suggesting most of the interaction is with the
> database? The answer is: it depends.
> With large Lucene documents: Lucene is the limiting factor (worsened
> by going single threaded).
> With small documents: it can be the DB.
>
> Other issues include waiting for complex queries on the DB to be ready
> (avoid sorting in the SQL!!).
> LuSql supports out-of-band joins (don't do the join in the SQL, but do
> the join from the client (with an additional- but low cost as it is
> usually on the primary key - query for each record); sometimes this is
> better; sometimes this is worse, depending on your DB design, queries,
> etc.)
>
> -Glen
>
> 2009/10/22 Thomas Becker <th...@net-m.de>:
>   
>> Profile your application first hand and find out where the bottlenecks really
>> are during indexing.
>>
>> For me it was clearly the database calls which took most of the time. Due to a
>> very complex SQL Query.
>> I applied the Producer - Consumer pattern and put a blocking queue in between. I
>> have a threadpool running x producers which are sending SQL Queries to the
>> database. Each returned row is put into the blockingQueue and another threadpool
>> running x (currently only 1) consumers is taking Objects from the row, converts
>> them to lucene documents and adds them to the index.
>> If the last row is put into the queue I add a Poison Pill to tell the consumer
>> to break.
>> Using a blockingQueue limited to 10.000 entries together with jdbc fetchSize
>> avoids high memory consumptions if too many producer threads return from the db.
>>
>> This way I could reduce indexing time from around 8h to 30 min. (really). But be
>> careful. Load on the DB Server will surely increase.
>>
>> Hope that helps.
>>
>> Cheers,
>> Thomas
>>
>> Paul Taylor wrote:
>>     
>>> I'm building a lucene index from a database, creating 1 about 1 million
>>> documents, unsuprisingly this takes quite a long time.
>>> I do this by sending a query  to the db over a range of ids , (10,000)
>>> records
>>> Add these results in Lucene
>>> Then get next 10,0000 and so on.
>>> When completed indexing I then call optimize()
>>> I also set  indexWriter.setMaxBufferedDocs(1000) and
>>> indexWriter.setMergeFactor(3000) but don't fully understand these values.
>>> Each document contains about 10 small fields
>>>
>>> I'm looking for some ways to improve performance.
>>>
>>> This index writing is single threaded, is there a way I can multi-thread
>>> writing to the indexing ?
>>> I only call optimize() once at the end, is the best way to do it.
>>> I'm going to run a profiler over the code, but are there any rules of
>>> thumbs on the best values to set for MaxBufferedDocs and Mergefactor()
>>>
>>> thanks Paul
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>       
>> --
>> Thomas Becker
>> Senior JEE Developer
>>
>> net mobile AG
>> Zollhof 17
>> 40221 Düsseldorf
>> GERMANY
>>
>> Phone:    +49 211 97020-195
>> Fax:      +49 211 97020-949
>> Mobile:   +49 173 5146567 (private)
>> E-Mail:   mailto:thomas.becker@net-m.de
>> Internet: http://www.net-m.de
>>
>> Registergericht:  Amtsgericht Düsseldorf, HRB 48022
>> Vorstand:         Theodor Niehues (Vorsitzender), Frank Hartmann,
>>                 Kai Markus Kulas, Dieter Plassmann
>> Vorsitzender des
>> Aufsichtsrates:   Dr. Michael Briem
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>
>
>

Re: Performance tips when creating a large index from database.

Posted by Glen Newton <gl...@gmail.com>.

This is basically what LuSql does. The time increases ("8h to 30 min")
are similar. Usually on the order of an order of magnitude.

Oh, the comments suggesting most of the interaction is with the
database? The answer is: it depends.
With large Lucene documents: Lucene is the limiting factor (worsened
by going single threaded).
With small documents: it can be the DB.

Other issues include waiting for complex queries on the DB to be ready
(avoid sorting in the SQL!!).
LuSql supports out-of-band joins (don't do the join in the SQL, but do
the join from the client (with an additional- but low cost as it is
usually on the primary key - query for each record); sometimes this is
better; sometimes this is worse, depending on your DB design, queries,
etc.)

-Glen

2009/10/22 Thomas Becker <th...@net-m.de>:
> Profile your application first hand and find out where the bottlenecks really
> are during indexing.
>
> For me it was clearly the database calls which took most of the time. Due to a
> very complex SQL Query.
> I applied the Producer - Consumer pattern and put a blocking queue in between. I
> have a threadpool running x producers which are sending SQL Queries to the
> database. Each returned row is put into the blockingQueue and another threadpool
> running x (currently only 1) consumers is taking Objects from the row, converts
> them to lucene documents and adds them to the index.
> If the last row is put into the queue I add a Poison Pill to tell the consumer
> to break.
> Using a blockingQueue limited to 10.000 entries together with jdbc fetchSize
> avoids high memory consumptions if too many producer threads return from the db.
>
> This way I could reduce indexing time from around 8h to 30 min. (really). But be
> careful. Load on the DB Server will surely increase.
>
> Hope that helps.
>
> Cheers,
> Thomas
>
> Paul Taylor wrote:
>> I'm building a lucene index from a database, creating 1 about 1 million
>> documents, unsuprisingly this takes quite a long time.
>> I do this by sending a query  to the db over a range of ids , (10,000)
>> records
>> Add these results in Lucene
>> Then get next 10,0000 and so on.
>> When completed indexing I then call optimize()
>> I also set  indexWriter.setMaxBufferedDocs(1000) and
>> indexWriter.setMergeFactor(3000) but don't fully understand these values.
>> Each document contains about 10 small fields
>>
>> I'm looking for some ways to improve performance.
>>
>> This index writing is single threaded, is there a way I can multi-thread
>> writing to the indexing ?
>> I only call optimize() once at the end, is the best way to do it.
>> I'm going to run a profiler over the code, but are there any rules of
>> thumbs on the best values to set for MaxBufferedDocs and Mergefactor()
>>
>> thanks Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --
> Thomas Becker
> Senior JEE Developer
>
> net mobile AG
> Zollhof 17
> 40221 Düsseldorf
> GERMANY
>
> Phone:    +49 211 97020-195
> Fax:      +49 211 97020-949
> Mobile:   +49 173 5146567 (private)
> E-Mail:   mailto:thomas.becker@net-m.de
> Internet: http://www.net-m.de
>
> Registergericht:  Amtsgericht Düsseldorf, HRB 48022
> Vorstand:         Theodor Niehues (Vorsitzender), Frank Hartmann,
>                 Kai Markus Kulas, Dieter Plassmann
> Vorsitzender des
> Aufsichtsrates:   Dr. Michael Briem
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance tips when creating a large index from database.

Posted by Thomas Becker <th...@net-m.de>.

Profile your application first hand and find out where the bottlenecks really
are during indexing.

For me it was clearly the database calls which took most of the time. Due to a
very complex SQL Query.
I applied the Producer - Consumer pattern and put a blocking queue in between. I
have a threadpool running x producers which are sending SQL Queries to the
database. Each returned row is put into the blockingQueue and another threadpool
running x (currently only 1) consumers is taking Objects from the row, converts
them to lucene documents and adds them to the index.
If the last row is put into the queue I add a Poison Pill to tell the consumer
to break.
Using a blockingQueue limited to 10.000 entries together with jdbc fetchSize
avoids high memory consumptions if too many producer threads return from the db.

This way I could reduce indexing time from around 8h to 30 min. (really). But be
careful. Load on the DB Server will surely increase.

Hope that helps.

Cheers,
Thomas

Paul Taylor wrote:
> I'm building a lucene index from a database, creating 1 about 1 million
> documents, unsuprisingly this takes quite a long time.
> I do this by sending a query  to the db over a range of ids , (10,000)
> records
> Add these results in Lucene
> Then get next 10,0000 and so on.
> When completed indexing I then call optimize()
> I also set  indexWriter.setMaxBufferedDocs(1000) and 
> indexWriter.setMergeFactor(3000) but don't fully understand these values.
> Each document contains about 10 small fields
> 
> I'm looking for some ways to improve performance.
> 
> This index writing is single threaded, is there a way I can multi-thread
> writing to the indexing ?
> I only call optimize() once at the end, is the best way to do it.
> I'm going to run a profiler over the code, but are there any rules of
> thumbs on the best values to set for MaxBufferedDocs and Mergefactor()
> 
> thanks Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

-- 
Thomas Becker
Senior JEE Developer

net mobile AG
Zollhof 17
40221 Düsseldorf
GERMANY

Phone:    +49 211 97020-195
Fax:      +49 211 97020-949
Mobile:   +49 173 5146567 (private)
E-Mail:   mailto:thomas.becker@net-m.de
Internet: http://www.net-m.de

Registergericht:  Amtsgericht Düsseldorf, HRB 48022
Vorstand:         Theodor Niehues (Vorsitzender), Frank Hartmann,
                 Kai Markus Kulas, Dieter Plassmann
Vorsitzender des
Aufsichtsrates:   Dr. Michael Briem

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org