You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Thompson,Roger" <th...@oclc.org> on 2007/09/14 14:19:43 UTC

Batch indexing a large number of records

Hi there!
 
I am embarking on re-engineering an application using Solr/Lucene (If
you'd like to see the current manifestation go to:
fictionfinder.oclc.org).  The database for this application consists of
approximatly 1.4 million records of varying size for the "work" record,
and another database of 1.9 million bibliographic records.  I fear that
loading this through http will take several days, perhaps a week.  Do
any of you have a way to do a large batch load of the DB?
 
Roger Thompson
 

Re: Batch indexing a large number of records

Posted by Mike Klaas <mi...@gmail.com>.
On 14-Sep-07, at 5:19 AM, Thompson,Roger wrote:

> Hi there!
>
> I am embarking on re-engineering an application using Solr/Lucene (If
> you'd like to see the current manifestation go to:
> fictionfinder.oclc.org).  The database for this application  
> consists of
> approximatly 1.4 million records of varying size for the "work"  
> record,
> and another database of 1.9 million bibliographic records.  I fear  
> that
> loading this through http will take several days, perhaps a week.  Do
> any of you have a way to do a large batch load of the DB?

I can index 2 million web documents in 7 hours over http.  Just batch  
a few (10) docs per http POST, and use around N+1 threads (N=#  
processors).

-Mike

Re: Batch indexing a large number of records

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Sep 14, 2007, at 8:19 AM, Thompson,Roger wrote:
> I am embarking on re-engineering an application using Solr/Lucene (If
> you'd like to see the current manifestation go to:
> fictionfinder.oclc.org).  The database for this application  
> consists of
> approximatly 1.4 million records of varying size for the "work"  
> record,
> and another database of 1.9 million bibliographic records.  I fear  
> that
> loading this through http will take several days, perhaps a week.  Do
> any of you have a way to do a large batch load of the DB?

It won't take that long.  Send multiple documents per POST and  
perhaps commit every big bunch or so.  I ingested 3.8M binary MARC  
records in a pretty crude way in less than a day.

But, the fastest way to ingest data into Solr out of the box, I  
think, is to use the CSV import capabilities.  I've indexed 1.8M  
bibliographic-sized records in 18 minutes with the CSV uploader,  
pointing it to a local file.

	Erik