You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by KNitin <ni...@gmail.com> on 2014/09/17 22:41:53 UTC

Loading an index (generated by map reduce) in SolrCloud

Hello

 I have generated a lucene index (with 6 shards) using Map Reduce. I want
to load this into a SolrCloud Cluster inside a collection.

Is there any out of the box way of doing this?  Any ideas are much
appreciated

Thanks
Nitin

Re: Loading an index (generated by map reduce) in SolrCloud

Posted by rulinma <ru...@gmail.com>.
copy is not a good choice, transfer to hdfs and merge.



--
View this message in context: http://lucene.472066.n3.nabble.com/Loading-an-index-generated-by-map-reduce-in-SolrCloud-tp4159530p4160855.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Loading an index (generated by map reduce) in SolrCloud

Posted by KNitin <ni...@gmail.com>.
Thanks for all the responses. I will try copying the corresponding segments
to the corresponding shards

On Wed, Sep 17, 2014 at 8:26 PM, ralph tice <ra...@gmail.com> wrote:

> If you are updating or deleting from your indexes I don't believe it is
> possible to get a consistent copy of the index from the file system
> directly without monkeying with hard links.  The safest thing is to use the
> ADDREPLICA command in the Collections API and then an UNLOAD from the CORE
> API if you want to take the data offline.  If you don't care to use
> additional servers/JVMs, you can use the replication handler to make backup
> instead.
>
> This older discussion covers most any backup strategy I can think of:
> http://grokbase.com/t/lucene/solr-user/12c37h0g18/backing-up-solr-4-0
>
> On Wed, Sep 17, 2014 at 9:01 PM, shushuai zhu <ss...@yahoo.com.invalid>
> wrote:
>
> > Hi, my case is a little simpler. For example, I have 100 collections now
> > in my solr cloud, and I want to backup 20 of them so I can restore them
> > later. I think I can just copy the index and log for each shard/core to
> > another location, then delete the collections. Later, I can create new
> > collections (likely with different names), then copy the index and log
> back
> > to the right directory structure on the node. After that, I can either
> > reload the collection or core.
> >
> > However, some testing shows these do not work. I could not reload the
> > collection or core. Have not tried re-starting the solr cloud. Can
> someone
> > point out the best way to achieve the goal? I prefer not to re-start solr
> > cloud.
> >
> > Shushuai
> >
> >
> > ________________________________
> >  From: ralph tice <ra...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Wednesday, September 17, 2014 6:53 PM
> > Subject: Re: Loading an index (generated by map reduce) in SolrCloud
> >
> >
> > FWIW, I do a lot of moving Lucene indexes around and as long as the core
> is
> > unloaded it's never been an issue for Solr to be running at the same
> time.
> >
> > If you move a core into the correct hierarchy for a replica, you can call
> > the Collections API's CREATESHARD action with the appropriate params
> (make
> > sure you use createNodeSet to point to the right server) and Solr will
> load
> > the index appropriately.  It's easiest to create a dummy shard and see
> > where data lands on your installation than to try to guess.
> >
> > Ex:
> > PORT=8983
> > SHARD=myshard
> > COLLECTION=mycollection
> > SOLR_HOST=box1.mysolr.corp
> > curl "http://
> >
> >
> ${SOLR_HOST}:${PORT}/solr/admin/collections?action=CREATESHARD&shard=${SHARD}&collection=${COLLECTION}&createNodeSet=${SOLR_HOST}:${PORT}_solr"
> >
> > One file to watch out for if you are moving cores across machines/JVMs is
> > the core.properties file, which you don't want to duplicate to another
> > server/location when moving a data directory.  I don't recommend trying
> to
> > move transaction logs around either.
> >
> >
> >
> >
> >
> > On Wed, Sep 17, 2014 at 5:22 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > Details please. You say MapReduce. Is this the
> > > MapReduceIndexerTool? If so, you can use
> > > the --go-live option to auto-merge them. Your
> > > Solr instances need to be running over HDFS
> > > though.
> > >
> > > If you don't have Solr running over HDFS, you can
> > > just copy the results for each shard "to the right place".
> > > What that means is that you must insure that the
> > > shards produced via MRIT get copied to the corresponding
> > > Solr local directory for each shard. If you put the wrong
> > > one in the wrong place you'll have trouble with multiple
> > > copies of documents showing up when you re-add any
> > > doc that already exists in your Solr installation.
> > >
> > > BTW, I'd surely stop all my Solr instances while copying
> > > all this around.
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Sep 17, 2014 at 1:41 PM, KNitin <ni...@gmail.com> wrote:
> > > > Hello
> > > >
> > > >  I have generated a lucene index (with 6 shards) using Map Reduce. I
> > want
> > > > to load this into a SolrCloud Cluster inside a collection.
> > > >
> > > > Is there any out of the box way of doing this?  Any ideas are much
> > > > appreciated
> > > >
> > > > Thanks
> > > > Nitin
> > >
> >
>

Re: Loading an index (generated by map reduce) in SolrCloud

Posted by ralph tice <ra...@gmail.com>.
If you are updating or deleting from your indexes I don't believe it is
possible to get a consistent copy of the index from the file system
directly without monkeying with hard links.  The safest thing is to use the
ADDREPLICA command in the Collections API and then an UNLOAD from the CORE
API if you want to take the data offline.  If you don't care to use
additional servers/JVMs, you can use the replication handler to make backup
instead.

This older discussion covers most any backup strategy I can think of:
http://grokbase.com/t/lucene/solr-user/12c37h0g18/backing-up-solr-4-0

On Wed, Sep 17, 2014 at 9:01 PM, shushuai zhu <ss...@yahoo.com.invalid>
wrote:

> Hi, my case is a little simpler. For example, I have 100 collections now
> in my solr cloud, and I want to backup 20 of them so I can restore them
> later. I think I can just copy the index and log for each shard/core to
> another location, then delete the collections. Later, I can create new
> collections (likely with different names), then copy the index and log back
> to the right directory structure on the node. After that, I can either
> reload the collection or core.
>
> However, some testing shows these do not work. I could not reload the
> collection or core. Have not tried re-starting the solr cloud. Can someone
> point out the best way to achieve the goal? I prefer not to re-start solr
> cloud.
>
> Shushuai
>
>
> ________________________________
>  From: ralph tice <ra...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, September 17, 2014 6:53 PM
> Subject: Re: Loading an index (generated by map reduce) in SolrCloud
>
>
> FWIW, I do a lot of moving Lucene indexes around and as long as the core is
> unloaded it's never been an issue for Solr to be running at the same time.
>
> If you move a core into the correct hierarchy for a replica, you can call
> the Collections API's CREATESHARD action with the appropriate params (make
> sure you use createNodeSet to point to the right server) and Solr will load
> the index appropriately.  It's easiest to create a dummy shard and see
> where data lands on your installation than to try to guess.
>
> Ex:
> PORT=8983
> SHARD=myshard
> COLLECTION=mycollection
> SOLR_HOST=box1.mysolr.corp
> curl "http://
>
> ${SOLR_HOST}:${PORT}/solr/admin/collections?action=CREATESHARD&shard=${SHARD}&collection=${COLLECTION}&createNodeSet=${SOLR_HOST}:${PORT}_solr"
>
> One file to watch out for if you are moving cores across machines/JVMs is
> the core.properties file, which you don't want to duplicate to another
> server/location when moving a data directory.  I don't recommend trying to
> move transaction logs around either.
>
>
>
>
>
> On Wed, Sep 17, 2014 at 5:22 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > Details please. You say MapReduce. Is this the
> > MapReduceIndexerTool? If so, you can use
> > the --go-live option to auto-merge them. Your
> > Solr instances need to be running over HDFS
> > though.
> >
> > If you don't have Solr running over HDFS, you can
> > just copy the results for each shard "to the right place".
> > What that means is that you must insure that the
> > shards produced via MRIT get copied to the corresponding
> > Solr local directory for each shard. If you put the wrong
> > one in the wrong place you'll have trouble with multiple
> > copies of documents showing up when you re-add any
> > doc that already exists in your Solr installation.
> >
> > BTW, I'd surely stop all my Solr instances while copying
> > all this around.
> >
> > Best,
> > Erick
> >
> > On Wed, Sep 17, 2014 at 1:41 PM, KNitin <ni...@gmail.com> wrote:
> > > Hello
> > >
> > >  I have generated a lucene index (with 6 shards) using Map Reduce. I
> want
> > > to load this into a SolrCloud Cluster inside a collection.
> > >
> > > Is there any out of the box way of doing this?  Any ideas are much
> > > appreciated
> > >
> > > Thanks
> > > Nitin
> >
>

Re: Loading an index (generated by map reduce) in SolrCloud

Posted by shushuai zhu <ss...@yahoo.com.INVALID>.
Hi, my case is a little simpler. For example, I have 100 collections now in my solr cloud, and I want to backup 20 of them so I can restore them later. I think I can just copy the index and log for each shard/core to another location, then delete the collections. Later, I can create new collections (likely with different names), then copy the index and log back to the right directory structure on the node. After that, I can either reload the collection or core.

However, some testing shows these do not work. I could not reload the collection or core. Have not tried re-starting the solr cloud. Can someone point out the best way to achieve the goal? I prefer not to re-start solr cloud. 

Shushuai
 

________________________________
 From: ralph tice <ra...@gmail.com>
To: solr-user@lucene.apache.org 
Sent: Wednesday, September 17, 2014 6:53 PM
Subject: Re: Loading an index (generated by map reduce) in SolrCloud
  

FWIW, I do a lot of moving Lucene indexes around and as long as the core is
unloaded it's never been an issue for Solr to be running at the same time.

If you move a core into the correct hierarchy for a replica, you can call
the Collections API's CREATESHARD action with the appropriate params (make
sure you use createNodeSet to point to the right server) and Solr will load
the index appropriately.  It's easiest to create a dummy shard and see
where data lands on your installation than to try to guess.

Ex:
PORT=8983
SHARD=myshard
COLLECTION=mycollection
SOLR_HOST=box1.mysolr.corp
curl "http://
${SOLR_HOST}:${PORT}/solr/admin/collections?action=CREATESHARD&shard=${SHARD}&collection=${COLLECTION}&createNodeSet=${SOLR_HOST}:${PORT}_solr"

One file to watch out for if you are moving cores across machines/JVMs is
the core.properties file, which you don't want to duplicate to another
server/location when moving a data directory.  I don't recommend trying to
move transaction logs around either.





On Wed, Sep 17, 2014 at 5:22 PM, Erick Erickson <er...@gmail.com>
wrote:

> Details please. You say MapReduce. Is this the
> MapReduceIndexerTool? If so, you can use
> the --go-live option to auto-merge them. Your
> Solr instances need to be running over HDFS
> though.
>
> If you don't have Solr running over HDFS, you can
> just copy the results for each shard "to the right place".
> What that means is that you must insure that the
> shards produced via MRIT get copied to the corresponding
> Solr local directory for each shard. If you put the wrong
> one in the wrong place you'll have trouble with multiple
> copies of documents showing up when you re-add any
> doc that already exists in your Solr installation.
>
> BTW, I'd surely stop all my Solr instances while copying
> all this around.
>
> Best,
> Erick
>
> On Wed, Sep 17, 2014 at 1:41 PM, KNitin <ni...@gmail.com> wrote:
> > Hello
> >
> >  I have generated a lucene index (with 6 shards) using Map Reduce. I want
> > to load this into a SolrCloud Cluster inside a collection.
> >
> > Is there any out of the box way of doing this?  Any ideas are much
> > appreciated
> >
> > Thanks
> > Nitin
>

Re: Loading an index (generated by map reduce) in SolrCloud

Posted by ralph tice <ra...@gmail.com>.
FWIW, I do a lot of moving Lucene indexes around and as long as the core is
unloaded it's never been an issue for Solr to be running at the same time.

If you move a core into the correct hierarchy for a replica, you can call
the Collections API's CREATESHARD action with the appropriate params (make
sure you use createNodeSet to point to the right server) and Solr will load
the index appropriately.  It's easiest to create a dummy shard and see
where data lands on your installation than to try to guess.

Ex:
PORT=8983
SHARD=myshard
COLLECTION=mycollection
SOLR_HOST=box1.mysolr.corp
curl "http://
${SOLR_HOST}:${PORT}/solr/admin/collections?action=CREATESHARD&shard=${SHARD}&collection=${COLLECTION}&createNodeSet=${SOLR_HOST}:${PORT}_solr"

One file to watch out for if you are moving cores across machines/JVMs is
the core.properties file, which you don't want to duplicate to another
server/location when moving a data directory.  I don't recommend trying to
move transaction logs around either.


On Wed, Sep 17, 2014 at 5:22 PM, Erick Erickson <er...@gmail.com>
wrote:

> Details please. You say MapReduce. Is this the
> MapReduceIndexerTool? If so, you can use
> the --go-live option to auto-merge them. Your
> Solr instances need to be running over HDFS
> though.
>
> If you don't have Solr running over HDFS, you can
> just copy the results for each shard "to the right place".
> What that means is that you must insure that the
> shards produced via MRIT get copied to the corresponding
> Solr local directory for each shard. If you put the wrong
> one in the wrong place you'll have trouble with multiple
> copies of documents showing up when you re-add any
> doc that already exists in your Solr installation.
>
> BTW, I'd surely stop all my Solr instances while copying
> all this around.
>
> Best,
> Erick
>
> On Wed, Sep 17, 2014 at 1:41 PM, KNitin <ni...@gmail.com> wrote:
> > Hello
> >
> >  I have generated a lucene index (with 6 shards) using Map Reduce. I want
> > to load this into a SolrCloud Cluster inside a collection.
> >
> > Is there any out of the box way of doing this?  Any ideas are much
> > appreciated
> >
> > Thanks
> > Nitin
>

Re: Loading an index (generated by map reduce) in SolrCloud

Posted by Erick Erickson <er...@gmail.com>.
Details please. You say MapReduce. Is this the
MapReduceIndexerTool? If so, you can use
the --go-live option to auto-merge them. Your
Solr instances need to be running over HDFS
though.

If you don't have Solr running over HDFS, you can
just copy the results for each shard "to the right place".
What that means is that you must insure that the
shards produced via MRIT get copied to the corresponding
Solr local directory for each shard. If you put the wrong
one in the wrong place you'll have trouble with multiple
copies of documents showing up when you re-add any
doc that already exists in your Solr installation.

BTW, I'd surely stop all my Solr instances while copying
all this around.

Best,
Erick

On Wed, Sep 17, 2014 at 1:41 PM, KNitin <ni...@gmail.com> wrote:
> Hello
>
>  I have generated a lucene index (with 6 shards) using Map Reduce. I want
> to load this into a SolrCloud Cluster inside a collection.
>
> Is there any out of the box way of doing this?  Any ideas are much
> appreciated
>
> Thanks
> Nitin