You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Darshan Pandya <da...@gmail.com> on 2016/09/08 15:09:37 UTC
MapReduceIndexerTool erroring with max_array_length
Hello,
While this may be a question for cloudera, I wanted to tap the brains of
this very active community as well.
I am trying to use the MapReduceIndexerTool to index data in a hive table
to Solr Cloud / Cloudera Search.
The tool is failing the job with the following error
1799 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1
files using 1 real mappers into 10 reducers
Error: MAX_ARRAY_LENGTH
Error: MAX_ARRAY_LENGTH
Error: MAX_ARRAY_LENGTH
36962 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool - Job
failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper,
jobId: job_1473161870114_0339
The error stack trace is
2016-09-08 10:39:20,128 ERROR [main]
org.apache.hadoop.mapred.YarnChild: Error running child :
java.lang.NoSuchFieldError: MAX_ARRAY_LENGTH
at org.apache.lucene.codecs.memory.DirectDocValuesFormat.<clinit>(DirectDocValuesFormat.java:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67)
at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:47)
at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
at org.apache.lucene.codecs.DocValuesFormat.<clinit>(DocValuesFormat.java:43)
at org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:205)
My Schema.xml looks like
<fields>
<field name="dataset_id" type="string" indexed="true" stored="true"
required="true" multiValued="false" docValue="true" />
<field name="search_string" type="string" indexed="true" stored="true"
docValue="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a
required field
-->
<uniqueKey>dataset_id</uniqueKey>
I am otherwise about to post documents using Solr APIs / upload methods.
Only the MapReduceIndexer tool is failing.
The command I am using is
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D
'mapred.child.java.opts=-Xmx500m' --log4j
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/s
hare/doc/search-1.0.0+cdh5.7.0+0/examples/solr-nrt/log4j.properties
--morphline-file /home/$USER/morphline2.conf --output-dir
hdfs://NNHOST:8020/user/$USER/outdir --verbose --zk-host ZKHOST:2181/solr1
--collection dataCatalog_search_index
hdfs://NNHOST:8020/user/hive/warehouse/name.db/concatenated_index4/;
My morphline config looks like
SOLR_LOCATOR : {
# Name of solr collection
collection : search_index
# ZooKeeper ensemble
$zkHost:2181/solr1"
}
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more (potentially
# nested) commands. A morphline is a way to consume records (e.g. Flume
events,
# HDFS files or blocks), turn them into a stream of records, and pipe the
stream
# of records through a set of easily configurable transformations on the
way to
# a target application such as Solr.
morphlines : [
{
id : search_index
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
readCSV {
separator : ","
columns : [dataset_id,search_string]
ignoreFirstLine : true
charset : UTF-8
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
#
# Command that deletes record fields that are unknown to Solr
# schema.xml.
#
# Recall that Solr throws an exception on any attempt to load a
document
# that contains a field that isn't specified in schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
}
}
# log the record at DEBUG level to SLF4J
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
Please let me know if I am going anything wrong.
--
Sincerely,
Darshan