You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mark Miller (JIRA)" <ji...@apache.org> on 2013/08/31 23:32:56 UTC

[jira] [Updated] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

     [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated SOLR-1301:
------------------------------

    Attachment: SOLR-1301.patch

Here is a patch with my current progress.

This is a Solr contrib module that can build Solr indexes in HDFS via MapReduce. It builds upon the Solr support for reading and writing to HDFS.

It supports a GoLive feature that allows merging into a running cluster as the final step of the MapReduce job.

There is fairly comprehensive help documentation as part of the MapReduceIndexerTool.

For ETL, Morphlines from the open source Cloudera CDK is used: https://github.com/cloudera/cdk/tree/master/cdk-morphlines This is the same ETL library that the Solr integration with Apache Flume uses.

What I have recently done: updated to latest code, fixed 5x requires solr.xml now, converted maven to ivy+ant, updated license files, fixed validation errors, integrated tests fully into test framework, got tests passing.

All tests are passing with this patch for me, but there are still a variety of issues to address:

* run yarn *and* mr1 - the maven build would run the unit tests against yarn or mr1 depending on the profile chosen on the command line - this patch runs against yarn.

* The MiniYarnCluster used for unit tests is hard coded to use the 'current-working-dir'/target path. This is a bad and illegal location. For the moment, I've relaxed the Lucene tests policy file to allow read/writes anywhere - this needs to be addressed before committing.

* We depend on some Morphline commands that depend on Solr - this could cause us problems in the future, and we want to own the code for this commands in Solr I think.

* There are thread leaks in the tests that should be looked into - some might not be avoidable as in other Hadoop tests (as we wait for fixes from the Hadoop project).

* We need to sync up with the latest code from the maven version - there have been some changes since this code was extracted.

There are a number of new contributors to this issue that I will be sure to enumerate in CHANGES.

I'll add whatever I'm forgetting in a later comment.
                
> Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Andrzej Bialecki 
>            Assignee: Mark Miller
>             Fix For: 4.5, 5.0
>
>         Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard.
> An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org