You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Darshan Pandya <da...@gmail.com> on 2016/09/08 15:09:37 UTC

MapReduceIndexerTool erroring with max_array_length

Hello,

While this may be a question for cloudera, I wanted to tap the brains of
this very active community as well.

I am trying to use the MapReduceIndexerTool to index data in a hive table
to Solr Cloud / Cloudera Search.

The tool is failing the job with the following error



1799 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1
files using 1 real mappers into 10 reducers

Error: MAX_ARRAY_LENGTH

Error: MAX_ARRAY_LENGTH

Error: MAX_ARRAY_LENGTH

36962 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool  - Job
failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper,
jobId: job_1473161870114_0339



The error stack trace is

2016-09-08 10:39:20,128 ERROR [main]
org.apache.hadoop.mapred.YarnChild: Error running child :
java.lang.NoSuchFieldError: MAX_ARRAY_LENGTH
	at org.apache.lucene.codecs.memory.DirectDocValuesFormat.<clinit>(DirectDocValuesFormat.java:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at java.lang.Class.newInstance(Class.java:374)
	at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67)
	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:47)
	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
	at org.apache.lucene.codecs.DocValuesFormat.<clinit>(DocValuesFormat.java:43)
	at org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:205)





My Schema.xml looks like



<fields>

   <field name="dataset_id" type="string" indexed="true" stored="true"
required="true" multiValued="false" docValue="true" />

   <field name="search_string" type="string" indexed="true" stored="true"
docValue="true"/>

   <field name="_version_" type="long" indexed="true" stored="true"/>

</fields>





<!-- Field to use to determine and enforce document uniqueness.

      Unless this field is marked with required="false", it will be a
required field

   -->

<uniqueKey>dataset_id</uniqueKey>





I am otherwise about to post documents using Solr APIs / upload methods.
Only the MapReduceIndexer tool is failing.



The command I am using is

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D
'mapred.child.java.opts=-Xmx500m' --log4j
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/s
hare/doc/search-1.0.0+cdh5.7.0+0/examples/solr-nrt/log4j.properties
--morphline-file /home/$USER/morphline2.conf --output-dir
hdfs://NNHOST:8020/user/$USER/outdir --verbose --zk-host ZKHOST:2181/solr1
--collection dataCatalog_search_index
hdfs://NNHOST:8020/user/hive/warehouse/name.db/concatenated_index4/;



My morphline config looks like



SOLR_LOCATOR : {

  # Name of solr collection

  collection : search_index



  # ZooKeeper ensemble

  $zkHost:2181/solr1"

}



# Specify an array of one or more morphlines, each of which defines an ETL

# transformation chain. A morphline consists of one or more (potentially

# nested) commands. A morphline is a way to consume records (e.g. Flume
events,

# HDFS files or blocks), turn them into a stream of records, and pipe the
stream

# of records through a set of easily configurable transformations on the
way to

# a target application such as Solr.

morphlines : [

  {

    id : search_index

    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [

      {

        readCSV {

          separator : ","

          columns : [dataset_id,search_string]

          ignoreFirstLine : true

          charset : UTF-8

        }

      }





      # Consume the output record of the previous command and pipe another

      # record downstream.

      #

      # Command that deletes record fields that are unknown to Solr

      # schema.xml.

      #

      # Recall that Solr throws an exception on any attempt to load a
document

      # that contains a field that isn't specified in schema.xml.

      {

        sanitizeUnknownSolrFields {

          # Location from which to fetch Solr schema

          solrLocator : ${SOLR_LOCATOR}

        }

      }



      # log the record at DEBUG level to SLF4J

      { logDebug { format : "output record: {}", args : ["@{}"] } }



      # load the record into a Solr server or MapReduce Reducer

      {

        loadSolr {

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]





Please let me know if I am going anything wrong.

-- 
Sincerely,
Darshan