You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Marc Sturlese <ma...@gmail.com> on 2010/01/15 17:34:20 UTC

Re: [jira] Created: (SOLR-1301) Solr + Hadoop

Hey there,
I just have started using hadoop to create Lucene/Solr indexes. Have couple
of questions.
I have seen there's a hadoop contrib to build a lucene index
(org.apache.hadoop.contrib.index). That contrib has a Partitioner to decide
for every map output wich reducer to go. It uses key.hashcode()%numShards.
Can be done something similar with this patch or the implementation
philosophy is different?

And not sure if this second question maybe should go to hadoop
mailing-lists. The patch build shards (or index) getting data from csv. If
you instead of using a csv get the data from a database, would this be a
pain for the db to have many mappers quering to the db?

Thanks in advance


JIRA jira@apache.org wrote:
> 
> Solr + Hadoop
> -------------
> 
>                  Key: SOLR-1301
>                  URL: https://issues.apache.org/jira/browse/SOLR-1301
>              Project: Solr
>           Issue Type: Improvement
>     Affects Versions: 1.4
>             Reporter: Andrzej Bialecki 
> 
> 
> This patch contains  a contrib module that provides distributed indexing
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
> twofold:
> 
> * provide an API that is familiar to Hadoop developers, i.e. that of
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on
> HDFS. SolrOutputFormat consumes data produced by reduce tasks directly,
> without storing it in intermediate files. Furthermore, by using an
> EmbeddedSolrServer, the indexing task is split into as many parts as there
> are reducers, and the data to be indexed is not sent over the network.
> 
> Design
> ----------
> 
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
> instantiates an EmbeddedSolrServer, and it also instantiates an
> implementation of SolrDocumentConverter, which is responsible for turning
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce
> task completes, and the OutputFormat is closed, SolrRecordWriter calls
> commit() and optimize() on the EmbeddedSolrServer.
> 
> The API provides facilities to specify an arbitrary existing solr.home
> directory, from which the conf/ and lib/ files will be taken.
> 
> This process results in the creation of as many partial Solr home
> directories as there were reduce tasks. The output shards are placed in
> the output directory on the default filesystem (e.g. HDFS). Such
> part-NNNNN directories can be used to run N shard servers. Additionally,
> users can specify the number of reduce tasks, in particular 1 reduce task,
> in which case the output will consist of a single shard.
> 
> An example application is provided that processes large CSV files and uses
> this API. It uses a custom CSV processing to avoid (de)serialization
> overhead.
> 
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
> issue, you should put it in contrib/hadoop/lib.
> 
> Note: the development of this patch was sponsored by an anonymous
> contributor and approved for release under Apache License.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/-jira--Created%3A-%28SOLR-1301%29-Solr-%2B-Hadoop-tp24604553p27179769.html
Sent from the Solr - Dev mailing list archive at Nabble.com.