You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by abhishes <ab...@gmail.com> on 2010/02/11 11:33:35 UTC

Posting Concurrently to Solr

Hello Everyone,

If I have a large data set which needs to be indexed, what strategy I can
take to build the index fast?

1. split the input into multiple xml files and then open different shells
and post each of the split xml file? will this work and help me build index
faster than 1 large xml file?

2. What if I don't want to build the XML files at all. I want to write the
extraction logic in an ETL tool and then let the ETL tool send the command
to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
is extracting the data from the backed and send it to Solr for indexing.

3. Use the Core Feature and then populate each core separately, then merge
the cores.

Any other approach?



-- 
View this message in context: http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Posting Concurrently to Solr

Posted by Vijayant Kumar <vi...@websitetoolbox.com>.
Why don't you approach for DIH

http://wiki.apache.org/solr/DataImportHandler


Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211
>
> Hello Everyone,
>
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
>
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build
> index
> faster than 1 large xml file?
>
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each
> thread
> is extracting the data from the backed and send it to Solr for indexing.
>
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
>
> Any other approach?
>
>
>
> --
> View this message in context:
> http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 




Re: Posting Concurrently to Solr

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
You did not say how frequent you need to update the index, if this is batch type of operation or if you also have some real-time requirements after the initial load.

Your ETL could use SolrJ and the StreamingUpdateSolrServer for high throughput.
You could try multiple threads pushing in parallell if your bottleneck is on the client side.
If that's not enough you can split your index into multiple cores/shards to get more parallell indexing power.
You don't need to merge them at the end, you can query using the shards parameter.

For extreme power for batch indexing, you can look at a map-reduce strategy: http://wiki.apache.org/solr/HadoopIndexing

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 11.33, abhishes wrote:

> 
> Hello Everyone,
> 
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
> 
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build index
> faster than 1 large xml file?
> 
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
> is extracting the data from the backed and send it to Solr for indexing.
> 
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
> 
> Any other approach?
> 
> 
> 
> -- 
> View this message in context: http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>