You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by KNitin <ni...@gmail.com> on 2015/11/19 20:17:38 UTC

Generating Index offline and loading into solrcloud

Hi,

 I was wondering if there are existing tools that will generate solr index
offline (in solrcloud mode)  that can be later on loaded into solrcloud,
before I decide to implement my own. I found some tools that do only solr
based index loading (non-zk mode). Is there one with zk mode enabled?


Thanks in advance!
Nitin

Re: Generating Index offline and loading into solrcloud

Posted by Erick Erickson <er...@gmail.com>.

Apples/Oranges question:

They're different beasts. The NRT stuff (spark-solr for example,
Cloudera's Flume sink as well, custom SolrJ clients, whatever) is
constrained by the number of Solr servers you have running, more
specifically the number of shards. When you're feeding docs fast
enough that you max out those CPUs, that's it; you're going flat out
and nothing you can do can drive indexing any faster.

With MRIT, you have the entire capacity of your Hadoop cluster at your
disposal. If you have 1,000 nodes you can be driving all of them as
fast as you can make them go, even if you only have 10 shards. Of
course in this case there'll be some copying time to deal with, but
you get the idea.

In terms of the end result, it's just a Lucene index; It doesn't
matter what process generates it.

Best,
Erick


On Thu, Nov 19, 2015 at 4:52 PM, KNitin <ni...@gmail.com> wrote:
> Ah got it. Another generic question, is there too much of a difference
> between generating files in map reduce and loading into solrcloud vs using
> solr NRT api? Has any one run any test of that sort?
>
> Thanks a ton,
> Nitin
>
> On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Sure, you can use Lucene to create indexes for shards
>> if (and only if) you deal with the routing issues....
>>
>> About updates: I'm not talking about atomic updates at all.
>> The usual model for Solr is if you have a unique key
>> defined, new versions of documents replace old versions
>> of documents based on uniqueKey. That process is
>> not guaranteed by MRIT is all.
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 19, 2015 at 12:56 PM, KNitin <ni...@gmail.com> wrote:
>> > Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
>> > mapper/reducer and uses that to index documents. Is that the recommended
>> > model? Can we use raw lucene libraries to generate index and then load
>> them
>> > into solrcloud? (Barring the complexities for indexing into right shard
>> and
>> > merging them).
>> >
>> > I am thinking of using this for regular offline indexing which needs to
>> be
>> > idempotent.  When you mean update do you mean partial updates using _set?
>> > If we add and delete every time for a document that should work, right?
>> > (since all docs are indexed by doc id which contains all operational
>> > history)? Let me know if I am missing something.
>> >
>> > On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> >> Note two things:
>> >>
>> >> 1> this is running on Hadoop
>> >> 2> it is part of the standard Solr release as MapReduceIndexerTool,
>> >> look in the contribs...
>> >>
>> >> If you're trying to do this yourself, you must be very careful to index
>> >> docs
>> >> to the correct shard then merge the correct shards. MRIT does this all
>> >> automatically.
>> >>
>> >> Additionally, it has the cool feature that if (and only if) your Solr
>> >> index is running over
>> >> HDFS, the --go-live option will automatically merge the indexes into
>> >> the appropriate
>> >> running Solr instances.
>> >>
>> >> One caveat. This tool doesn't handle _updating_ documents. So if you
>> >> run it twice
>> >> on the same data set, you'll have two copies of every doc. It's
>> >> designed as a bulk
>> >> initial-load tool.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>
>> >>
>> >> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <ni...@gmail.com> wrote:
>> >> > Great. Thanks!
>> >> >
>> >> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
>> >> sameer@measuredsearch.com>
>> >> > wrote:
>> >> >
>> >> >> If you are trying to create a large index and want speedups there,
>> you
>> >> >> could use the MapReduceTool -
>> >> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr.
>> At
>> >> a
>> >> >> high level, it takes your files (csv, json, etc) as input can create
>> >> either
>> >> >> a single or a sharded index that you can either copy it to your Solr
>> >> >> Servers. I've used this to create indexes that include hundreds of
>> >> millions
>> >> >> of documents in fairly decent amount of time.
>> >> >>
>> >> >> Thanks,
>> >> >> --
>> >> >> *Sameer Maggon*
>> >> >> Measured Search
>> >> >> www.measuredsearch.com <http://measuredsearch.com/>
>> >> >>
>> >> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com>
>> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> >  I was wondering if there are existing tools that will generate
>> solr
>> >> >> index
>> >> >> > offline (in solrcloud mode)  that can be later on loaded into
>> >> solrcloud,
>> >> >> > before I decide to implement my own. I found some tools that do
>> only
>> >> solr
>> >> >> > based index loading (non-zk mode). Is there one with zk mode
>> enabled?
>> >> >> >
>> >> >> >
>> >> >> > Thanks in advance!
>> >> >> > Nitin
>> >> >> >
>> >> >>
>> >>
>>

Re: Generating Index offline and loading into solrcloud

Posted by KNitin <ni...@gmail.com>.

Ah got it. Another generic question, is there too much of a difference
between generating files in map reduce and loading into solrcloud vs using
solr NRT api? Has any one run any test of that sort?

Thanks a ton,
Nitin

On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson <er...@gmail.com>
wrote:

> Sure, you can use Lucene to create indexes for shards
> if (and only if) you deal with the routing issues....
>
> About updates: I'm not talking about atomic updates at all.
> The usual model for Solr is if you have a unique key
> defined, new versions of documents replace old versions
> of documents based on uniqueKey. That process is
> not guaranteed by MRIT is all.
>
> Best,
> Erick
>
> On Thu, Nov 19, 2015 at 12:56 PM, KNitin <ni...@gmail.com> wrote:
> > Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
> > mapper/reducer and uses that to index documents. Is that the recommended
> > model? Can we use raw lucene libraries to generate index and then load
> them
> > into solrcloud? (Barring the complexities for indexing into right shard
> and
> > merging them).
> >
> > I am thinking of using this for regular offline indexing which needs to
> be
> > idempotent.  When you mean update do you mean partial updates using _set?
> > If we add and delete every time for a document that should work, right?
> > (since all docs are indexed by doc id which contains all operational
> > history)? Let me know if I am missing something.
> >
> > On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> >> Note two things:
> >>
> >> 1> this is running on Hadoop
> >> 2> it is part of the standard Solr release as MapReduceIndexerTool,
> >> look in the contribs...
> >>
> >> If you're trying to do this yourself, you must be very careful to index
> >> docs
> >> to the correct shard then merge the correct shards. MRIT does this all
> >> automatically.
> >>
> >> Additionally, it has the cool feature that if (and only if) your Solr
> >> index is running over
> >> HDFS, the --go-live option will automatically merge the indexes into
> >> the appropriate
> >> running Solr instances.
> >>
> >> One caveat. This tool doesn't handle _updating_ documents. So if you
> >> run it twice
> >> on the same data set, you'll have two copies of every doc. It's
> >> designed as a bulk
> >> initial-load tool.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <ni...@gmail.com> wrote:
> >> > Great. Thanks!
> >> >
> >> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
> >> sameer@measuredsearch.com>
> >> > wrote:
> >> >
> >> >> If you are trying to create a large index and want speedups there,
> you
> >> >> could use the MapReduceTool -
> >> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr.
> At
> >> a
> >> >> high level, it takes your files (csv, json, etc) as input can create
> >> either
> >> >> a single or a sharded index that you can either copy it to your Solr
> >> >> Servers. I've used this to create indexes that include hundreds of
> >> millions
> >> >> of documents in fairly decent amount of time.
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> *Sameer Maggon*
> >> >> Measured Search
> >> >> www.measuredsearch.com <http://measuredsearch.com/>
> >> >>
> >> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com>
> wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> >  I was wondering if there are existing tools that will generate
> solr
> >> >> index
> >> >> > offline (in solrcloud mode)  that can be later on loaded into
> >> solrcloud,
> >> >> > before I decide to implement my own. I found some tools that do
> only
> >> solr
> >> >> > based index loading (non-zk mode). Is there one with zk mode
> enabled?
> >> >> >
> >> >> >
> >> >> > Thanks in advance!
> >> >> > Nitin
> >> >> >
> >> >>
> >>
>

Re: Generating Index offline and loading into solrcloud

Posted by Erick Erickson <er...@gmail.com>.

Sure, you can use Lucene to create indexes for shards
if (and only if) you deal with the routing issues....

About updates: I'm not talking about atomic updates at all.
The usual model for Solr is if you have a unique key
defined, new versions of documents replace old versions
of documents based on uniqueKey. That process is
not guaranteed by MRIT is all.

Best,
Erick

On Thu, Nov 19, 2015 at 12:56 PM, KNitin <ni...@gmail.com> wrote:
> Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
> mapper/reducer and uses that to index documents. Is that the recommended
> model? Can we use raw lucene libraries to generate index and then load them
> into solrcloud? (Barring the complexities for indexing into right shard and
> merging them).
>
> I am thinking of using this for regular offline indexing which needs to be
> idempotent.  When you mean update do you mean partial updates using _set?
> If we add and delete every time for a document that should work, right?
> (since all docs are indexed by doc id which contains all operational
> history)? Let me know if I am missing something.
>
> On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Note two things:
>>
>> 1> this is running on Hadoop
>> 2> it is part of the standard Solr release as MapReduceIndexerTool,
>> look in the contribs...
>>
>> If you're trying to do this yourself, you must be very careful to index
>> docs
>> to the correct shard then merge the correct shards. MRIT does this all
>> automatically.
>>
>> Additionally, it has the cool feature that if (and only if) your Solr
>> index is running over
>> HDFS, the --go-live option will automatically merge the indexes into
>> the appropriate
>> running Solr instances.
>>
>> One caveat. This tool doesn't handle _updating_ documents. So if you
>> run it twice
>> on the same data set, you'll have two copies of every doc. It's
>> designed as a bulk
>> initial-load tool.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <ni...@gmail.com> wrote:
>> > Great. Thanks!
>> >
>> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
>> sameer@measuredsearch.com>
>> > wrote:
>> >
>> >> If you are trying to create a large index and want speedups there, you
>> >> could use the MapReduceTool -
>> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At
>> a
>> >> high level, it takes your files (csv, json, etc) as input can create
>> either
>> >> a single or a sharded index that you can either copy it to your Solr
>> >> Servers. I've used this to create indexes that include hundreds of
>> millions
>> >> of documents in fairly decent amount of time.
>> >>
>> >> Thanks,
>> >> --
>> >> *Sameer Maggon*
>> >> Measured Search
>> >> www.measuredsearch.com <http://measuredsearch.com/>
>> >>
>> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> >  I was wondering if there are existing tools that will generate solr
>> >> index
>> >> > offline (in solrcloud mode)  that can be later on loaded into
>> solrcloud,
>> >> > before I decide to implement my own. I found some tools that do only
>> solr
>> >> > based index loading (non-zk mode). Is there one with zk mode enabled?
>> >> >
>> >> >
>> >> > Thanks in advance!
>> >> > Nitin
>> >> >
>> >>
>>

Re: Generating Index offline and loading into solrcloud

Posted by KNitin <ni...@gmail.com>.

Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
mapper/reducer and uses that to index documents. Is that the recommended
model? Can we use raw lucene libraries to generate index and then load them
into solrcloud? (Barring the complexities for indexing into right shard and
merging them).

I am thinking of using this for regular offline indexing which needs to be
idempotent.  When you mean update do you mean partial updates using _set?
If we add and delete every time for a document that should work, right?
(since all docs are indexed by doc id which contains all operational
history)? Let me know if I am missing something.

On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <er...@gmail.com>
wrote:

> Note two things:
>
> 1> this is running on Hadoop
> 2> it is part of the standard Solr release as MapReduceIndexerTool,
> look in the contribs...
>
> If you're trying to do this yourself, you must be very careful to index
> docs
> to the correct shard then merge the correct shards. MRIT does this all
> automatically.
>
> Additionally, it has the cool feature that if (and only if) your Solr
> index is running over
> HDFS, the --go-live option will automatically merge the indexes into
> the appropriate
> running Solr instances.
>
> One caveat. This tool doesn't handle _updating_ documents. So if you
> run it twice
> on the same data set, you'll have two copies of every doc. It's
> designed as a bulk
> initial-load tool.
>
> Best,
> Erick
>
>
>
> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <ni...@gmail.com> wrote:
> > Great. Thanks!
> >
> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
> sameer@measuredsearch.com>
> > wrote:
> >
> >> If you are trying to create a large index and want speedups there, you
> >> could use the MapReduceTool -
> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At
> a
> >> high level, it takes your files (csv, json, etc) as input can create
> either
> >> a single or a sharded index that you can either copy it to your Solr
> >> Servers. I've used this to create indexes that include hundreds of
> millions
> >> of documents in fairly decent amount of time.
> >>
> >> Thanks,
> >> --
> >> *Sameer Maggon*
> >> Measured Search
> >> www.measuredsearch.com <http://measuredsearch.com/>
> >>
> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> >  I was wondering if there are existing tools that will generate solr
> >> index
> >> > offline (in solrcloud mode)  that can be later on loaded into
> solrcloud,
> >> > before I decide to implement my own. I found some tools that do only
> solr
> >> > based index loading (non-zk mode). Is there one with zk mode enabled?
> >> >
> >> >
> >> > Thanks in advance!
> >> > Nitin
> >> >
> >>
>

Re: Generating Index offline and loading into solrcloud

Posted by Erick Erickson <er...@gmail.com>.

Note two things:

1> this is running on Hadoop
2> it is part of the standard Solr release as MapReduceIndexerTool,
look in the contribs...

If you're trying to do this yourself, you must be very careful to index docs
to the correct shard then merge the correct shards. MRIT does this all
automatically.

Additionally, it has the cool feature that if (and only if) your Solr
index is running over
HDFS, the --go-live option will automatically merge the indexes into
the appropriate
running Solr instances.

One caveat. This tool doesn't handle _updating_ documents. So if you
run it twice
on the same data set, you'll have two copies of every doc. It's
designed as a bulk
initial-load tool.

Best,
Erick

On Thu, Nov 19, 2015 at 11:45 AM, KNitin <ni...@gmail.com> wrote:
> Great. Thanks!
>
> On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <sa...@measuredsearch.com>
> wrote:
>
>> If you are trying to create a large index and want speedups there, you
>> could use the MapReduceTool -
>> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a
>> high level, it takes your files (csv, json, etc) as input can create either
>> a single or a sharded index that you can either copy it to your Solr
>> Servers. I've used this to create indexes that include hundreds of millions
>> of documents in fairly decent amount of time.
>>
>> Thanks,
>> --
>> *Sameer Maggon*
>> Measured Search
>> www.measuredsearch.com <http://measuredsearch.com/>
>>
>> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> >  I was wondering if there are existing tools that will generate solr
>> index
>> > offline (in solrcloud mode)  that can be later on loaded into solrcloud,
>> > before I decide to implement my own. I found some tools that do only solr
>> > based index loading (non-zk mode). Is there one with zk mode enabled?
>> >
>> >
>> > Thanks in advance!
>> > Nitin
>> >
>>

Re: Generating Index offline and loading into solrcloud

Posted by KNitin <ni...@gmail.com>.

Great. Thanks!

On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <sa...@measuredsearch.com>
wrote:

> If you are trying to create a large index and want speedups there, you
> could use the MapReduceTool -
> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a
> high level, it takes your files (csv, json, etc) as input can create either
> a single or a sharded index that you can either copy it to your Solr
> Servers. I've used this to create indexes that include hundreds of millions
> of documents in fairly decent amount of time.
>
> Thanks,
> --
> *Sameer Maggon*
> Measured Search
> www.measuredsearch.com <http://measuredsearch.com/>
>
> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com> wrote:
>
> > Hi,
> >
> >  I was wondering if there are existing tools that will generate solr
> index
> > offline (in solrcloud mode)  that can be later on loaded into solrcloud,
> > before I decide to implement my own. I found some tools that do only solr
> > based index loading (non-zk mode). Is there one with zk mode enabled?
> >
> >
> > Thanks in advance!
> > Nitin
> >
>

Re: Generating Index offline and loading into solrcloud

Posted by vchauras <vi...@gmail.com>.

Hey Sameer,

I tried using the tool on hadoop master node (AWS EMR) like:

hadoop jar cloudera-search-1.0.0-cdh5.2.0-jar-with-dependencies.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D 'mapred.child.java.opts=-Xmx500m' \
--log4j ~/log4j.properties \
--morphline-file /home/hadoop/morphlines1.conf \
--output-dir hdfs://172.31.77.5:8020/tmp/outdir \
--input-list hdfs://172.31.77.5:8020/index_files/index_1.txt \
--verbose \
--solr-home-dir /home/hadoop/friending \
--shards 1 \
--dry-run

I have few doubts:
1) does the tool require solr to be installed on the machine? I see error
coming:

Exception in thread "main" org.apache.solr.common.SolrException: Error
loading class 'solr.NoOpRegenerator'
	at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:449)
	at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:471)
	at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:467)
	at org.apache.solr.search.CacheConfig.getConfig(CacheConfig.java:100)
	at
org.apache.solr.search.CacheConfig.getMultipleConfigs(CacheConfig.java:73)
	at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:198)
	at
org.kitesdk.morphline.solr.SolrLocator.getIndexSchema(SolrLocator.java:168)
	at
org.apache.solr.hadoop.morphline.MorphlineMapRunner.<init>(MorphlineMapRunner.java:134)
	at
org.apache.solr.hadoop.MapReduceIndexerTool.setupMorphline(MapReduceIndexerTool.java:1154)
	at
org.apache.solr.hadoop.MapReduceIndexerTool.run(MapReduceIndexerTool.java:794)


My content of friending/conf looks like:

hadoop@ip-172-31-77-5 22:42:18 ~ ls -ltr /home/hadoop/friending/conf/
total 80
-rw-r--r-- 1 hadoop hadoop  1119 Jan 29 20:34 synonyms.txt
-rw-r--r-- 1 hadoop hadoop   781 Jan 29 20:34 stopwords.txt
-rw-r--r-- 1 hadoop hadoop   873 Jan 29 20:34 protwords.txt
-rw-r--r-- 1 hadoop hadoop  1348 Jan 29 20:34 elevate.xml
-rw-r--r-- 1 hadoop hadoop  3974 Jan 29 20:34 currency.xml
-rw-r--r-- 1 hadoop hadoop  2572 Jan 29 20:39 schema.xml
drwxr-xr-x 2 hadoop hadoop  4096 Jan 29 20:42 lang
-rw-r--r-- 1 hadoop hadoop 49806 Jan 29 22:22 solrconfig.xml

I have been struggling to get this tool working. Any help would be
appreciated. 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Generating Index offline and loading into solrcloud

Posted by Sameer Maggon <sa...@measuredsearch.com>.

If you are trying to create a large index and want speedups there, you
could use the MapReduceTool -
https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a
high level, it takes your files (csv, json, etc) as input can create either
a single or a sharded index that you can either copy it to your Solr
Servers. I've used this to create indexes that include hundreds of millions
of documents in fairly decent amount of time.

Thanks,
-- 
*Sameer Maggon*
Measured Search
www.measuredsearch.com <http://measuredsearch.com/>

On Thu, Nov 19, 2015 at 11:17 AM, KNitin <ni...@gmail.com> wrote:

> Hi,
>
>  I was wondering if there are existing tools that will generate solr index
> offline (in solrcloud mode)  that can be later on loaded into solrcloud,
> before I decide to implement my own. I found some tools that do only solr
> based index loading (non-zk mode). Is there one with zk mode enabled?
>
>
> Thanks in advance!
> Nitin
>