You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Parks <da...@yahoo.com> on 2013/05/06 10:38:57 UTC

Indexing off of the production servers

I've had trouble figuring out what options exist if I want to perform all
indexing off of the production servers (I'd like to keep them only for user
queries).

 

We index data in batches roughly daily, ideally I'd index all solr cloud
shards offline, then move the final index files to the solr cloud instance
that needs it and flip a switch and have it use the new index.

 

Is this possible via either:

1.       Doing the indexing in Hadoop?? (this would be ideal as we have a
significant investment in a hadoop cluster already), or

2.       Maintaining a separate "master" server that handles indexing and
the nodes that receive user queries update their index from there (I seem to
recall reading about this configuration in 3.x, but now we're using solr
cloud)

 

Is there some ideal solution I can use to "protect" the production solr
instances from degraded performance during large index processing periods?

 

Thanks!

David


RE: Indexing off of the production servers

Posted by David Parks <da...@yahoo.com>.
So, am I following this correctly by saying that, this proposed solution
would present us a way to index a collection on an offline/dev solr cloud
instance and *move* that pre-prepared index to the production server using
an alias/rename trick?

That seems like a reasonably doable solution. I also wonder how much work it
is to build the shards programmatically (e.g. directly in a hadoop/java
environment), cutting out the extra step of needing another solr instances
running on a staging environment somewhere. Then using this technique to
swap in the shards.

I might do something like this first and then look into simplifying, and
further automating, later on. And if it is indeed possible to build a hadoop
driver for indexing, I think that would be a useful tool for the community
at large. So I'm still curious about it, at least as a thought exercise, if
nothing else.

Thanks,
Dave


-----Original Message-----
From: Furkan KAMACI [mailto:furkankamaci@gmail.com] 
Sent: Monday, May 06, 2013 9:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

Thanks for your answer. I have read that at somewhere:

I believe "redirect" from replica to leader would happen only at index time,
so a doc first gets indexed to leader and from there it's replicated to
non-leader shards.

Is that true? I want to make clear the things in my mind otherwise I want to
ask a separate question about what happens for indexing and querying at
SolrCloud.

2013/5/6 Shawn Heisey <so...@elyograg.org>

> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> > Excellent idea !
> > And it is possible to use collection aliasing with the CREATEALIAS 
> > to make this transparent for the query side.
> >
> > ex. with 2 collections named :
> > collection_1
> > collection_2
> >
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=colle
> ction_1
> >
> > "collectionalias" is now a virtual collection pointing to collection_1.
> >
> > Index on collection_2, then :
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=colle
> ction_2
> >
> > "collectionalias" now is an alias to collection_2.
> >
> >
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Col
> lections_API
>
> Awesome idea, Andre! I was wondering whether you might have to delete 
> the original alias before creating the new one, but a quick look at 
> the issue for collection aliasing shows that this isn't the case.
>
> https://issues.apache.org/jira/browse/SOLR-4497
>
> The wiki doesn't mention the DELETEALIAS action.  I won't have time 
> right now to update the wiki.
>
> Thanks,
> Shawn
>
>


Re: Indexing off of the production servers

Posted by Erick Erickson <er...@gmail.com>.
Nope. There is no replication, as in replication of the indexed
document in the normal flow. The _raw_ document is forwarded to all
replicas and upon return from the replicas, the raw document has been
written to each individual transaction log on each replica.
"replication" implies the _indexed_ form of the document is what's
forwarded to the replicas, and that's not the case.

It's somewhat confusing, but _if_ a replica goes down, when it comes
back up if it's "too far" out of date then an old-style replication of
the whole index is performed. But absent that it's all raw documents
forwarded to replicas from the leader.

Otherwise, how could you hope that a replica could take over without
loss of data? The leader could have gone down before it forwarded the
docs but after it responded to the client.

Best
Erick

On Mon, May 6, 2013 at 10:43 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> Hi Erick;
>
> Thanks for your answer. I have read that at somewhere:
>
> I believe "redirect" from replica to leader would happen only at
> index time, so a doc first gets indexed to leader and from there it's
> replicated to non-leader shards.
>
> Is that true? I want to make clear the things in my mind otherwise I want
> to ask a separate question about what happens for indexing and querying at
> SolrCloud.
>
> 2013/5/6 Shawn Heisey <so...@elyograg.org>
>
>> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
>> > Excellent idea !
>> > And it is possible to use collection aliasing with the CREATEALIAS to
>> > make this transparent for the query side.
>> >
>> > ex. with 2 collections named :
>> > collection_1
>> > collection_2
>> >
>> >
>> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
>> >
>> > "collectionalias" is now a virtual collection pointing to collection_1.
>> >
>> > Index on collection_2, then :
>> >
>> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
>> >
>> > "collectionalias" now is an alias to collection_2.
>> >
>> >
>> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API
>>
>> Awesome idea, Andre! I was wondering whether you might have to delete
>> the original alias before creating the new one, but a quick look at the
>> issue for collection aliasing shows that this isn't the case.
>>
>> https://issues.apache.org/jira/browse/SOLR-4497
>>
>> The wiki doesn't mention the DELETEALIAS action.  I won't have time
>> right now to update the wiki.
>>
>> Thanks,
>> Shawn
>>
>>

Re: Indexing off of the production servers

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Erick;

Thanks for your answer. I have read that at somewhere:

I believe "redirect" from replica to leader would happen only at
index time, so a doc first gets indexed to leader and from there it's
replicated to non-leader shards.

Is that true? I want to make clear the things in my mind otherwise I want
to ask a separate question about what happens for indexing and querying at
SolrCloud.

2013/5/6 Shawn Heisey <so...@elyograg.org>

> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> > Excellent idea !
> > And it is possible to use collection aliasing with the CREATEALIAS to
> > make this transparent for the query side.
> >
> > ex. with 2 collections named :
> > collection_1
> > collection_2
> >
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
> >
> > "collectionalias" is now a virtual collection pointing to collection_1.
> >
> > Index on collection_2, then :
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
> >
> > "collectionalias" now is an alias to collection_2.
> >
> >
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API
>
> Awesome idea, Andre! I was wondering whether you might have to delete
> the original alias before creating the new one, but a quick look at the
> issue for collection aliasing shows that this isn't the case.
>
> https://issues.apache.org/jira/browse/SOLR-4497
>
> The wiki doesn't mention the DELETEALIAS action.  I won't have time
> right now to update the wiki.
>
> Thanks,
> Shawn
>
>

Re: Indexing off of the production servers

Posted by Shawn Heisey <so...@elyograg.org>.
On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> Excellent idea !
> And it is possible to use collection aliasing with the CREATEALIAS to
> make this transparent for the query side.
> 
> ex. with 2 collections named :
> collection_1
> collection_2
> 
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
> 
> "collectionalias" is now a virtual collection pointing to collection_1.
> 
> Index on collection_2, then :
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
> 
> "collectionalias" now is an alias to collection_2.
> 
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API

Awesome idea, Andre! I was wondering whether you might have to delete
the original alias before creating the new one, but a quick look at the
issue for collection aliasing shows that this isn't the case.

https://issues.apache.org/jira/browse/SOLR-4497

The wiki doesn't mention the DELETEALIAS action.  I won't have time
right now to update the wiki.

Thanks,
Shawn


Re: Indexing off of the production servers

Posted by Andre Bois-Crettez <an...@kelkoo.com>.
Excellent idea !
And it is possible to use collection aliasing with the CREATEALIAS to
make this transparent for the query side.

ex. with 2 collections named :
collection_1
collection_2

/collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
"collectionalias" is now a virtual collection pointing to collection_1.

Index on collection_2, then :
/collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
"collectionalias" now is an alias to collection_2.

http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API


André

On 05/06/2013 03:05 PM, Upayavira wrote:
> In non-SolrCloud mode, you can index to another core, and then swap
> cores. You could index on another box, ship the index files to your
> production server, create a core pointing at these files, then swap this
> core with the original one.
>
> If you can tell your search app to switch to using a different
> collection, you could achieve what you want with solrcloud.
>
> You index to a different collection, which is running on different set
> of SolrCloud nodes from your production search. Once indexing is
> complete, you create cores on your production boxes for this new
> collection. Once indexes have synced, you can switch your app to use
> this new collection, thus publishing your new index. You can then delete
> the cores on the boxes you were using for indexing.
>
> Now, that's not transparent, but would be do-able.
>
> Upayavira
>
> On Mon, May 6, 2013, at 01:37 PM, David Parks wrote:
>> I'm less concerned with fully utilizing a hadoop cluster (due to having
>> fewer shards than I have hadoop reduce slots) as I am with just
>> off-loading
>> the whole indexing process. We may just want to re-index the whole thing
>> to
>> add some index time boosts or whatever else we conjure up to make queries
>> faster and better quality. We're doing a lot of work on optimization
>> right
>> now.
>>
>> To re-index the whole thing is a 5-10 hour process for us, so when we
>> move
>> some update to production that requires full re-indexing (every week or
>> so),
>> right now we're just re-building new instances of solr to handle the
>> re-indexing and then copying the final VMs to the production environment
>> (slow process). I'm leery of letting a heavy duty full re-index process
>> loose for 10 hours on production on a regular basis.
>>
>> It doesn't sound like there are any pre-built processes for doing this
>> now
>> though. I thought I had heard of master/slave hierarchy in 3.x that would
>> allow us to designate a master to do indexing and let the slaves pull
>> finished indexes from the master, so I thought maybe something like that
>> followed into solr cloud. Eric might be right in that it's not worth the
>> effort if there isn't some existing strategy.
>>
>> Dave
>>
>>
>> -----Original Message-----
>> From: Furkan KAMACI [mailto:furkankamaci@gmail.com]
>> Sent: Monday, May 06, 2013 7:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing off of the production servers
>>
>> Hi Erick;
>>
>> I think that even if you use Map/Reduce you will not parallelize you
>> indexing because indexing will parallelize as much as how many leaders
>> you
>> have at your SolrCloud, isn't it?
>>
>> 2013/5/6 Erick Erickson<er...@gmail.com>
>>
>>> The only problem with using Hadoop (or whatever) is that you need to
>>> be sure that documents end up on the same shard, which means that you
>>> have to use the same routing mechanism that SolrCloud uses. The custom
>>> doc routing may help here....
>>>
>>> My very first question, though, would be whether this is necessary.
>>> It might be sufficient to just throttle the rate of indexing, or just
>>> do the indexing during off hours or.... Have you measured an indexing
>>> degradation during your heavy indexing? Indexing has costs, no
>>> question, but it's worth asking whether the costs are heavy enough to
>>> be worth the bother..
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI<fu...@gmail.com>
>>> wrote:
>>>> 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
>>>> use Map/Reduce jobs you split your workload, process it, and then
>>>> reduce step takes into account. Let me explain you new SolrCloud
>>>> architecture. You start your SolrCluoud with a numShards parameter.
>>>> Let's assume that you have 5 shards. Then you will have 5 leader at
>>>> your SolrCloud. These
>>> leaders
>>>> will be responsible for indexing your data. It means that your
>>>> indexing workload will divided into 5 so it means that you have
>>>> parallelized your data as like Map/Reduce jobs.
>>>>
>>>> Let's assume that you have added 10 new Solr nodes into your SolrCloud.
>>>> They will be added as a replica for each shard. Then you will have 5
>>>> shards, 5 leaders of them and every shard has 2 replica. When you
>>>> send a query into a SolrCloud every replica will help you for
>>>> searching and if
>>> you
>>>> add more replicas to your SolrCloud your search performance will
>> improve.
>>>>
>>>> 2013/5/6 David Parks<da...@yahoo.com>
>>>>
>>>>> I've had trouble figuring out what options exist if I want to
>>>>> perform
>>> all
>>>>> indexing off of the production servers (I'd like to keep them only
>>>>> for
>>> user
>>>>> queries).
>>>>>
>>>>>
>>>>>
>>>>> We index data in batches roughly daily, ideally I'd index all solr
>>>>> cloud shards offline, then move the final index files to the solr
>>>>> cloud
>>> instance
>>>>> that needs it and flip a switch and have it use the new index.
>>>>>
>>>>>
>>>>>
>>>>> Is this possible via either:
>>>>>
>>>>> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
>>> a
>>>>> significant investment in a hadoop cluster already), or
>>>>>
>>>>> 2.       Maintaining a separate "master" server that handles indexing
>>> and
>>>>> the nodes that receive user queries update their index from there
>>>>> (I
>>> seem
>>>>> to
>>>>> recall reading about this configuration in 3.x, but now we're using
>>>>> solr
>>>>> cloud)
>>>>>
>>>>>
>>>>>
>>>>> Is there some ideal solution I can use to "protect" the production
>>>>> solr instances from degraded performance during large index
>>>>> processing
>>> periods?
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> David
>>>>>
>>>>>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: Indexing off of the production servers

Posted by Upayavira <uv...@odoko.co.uk>.
In non-SolrCloud mode, you can index to another core, and then swap
cores. You could index on another box, ship the index files to your
production server, create a core pointing at these files, then swap this
core with the original one.

If you can tell your search app to switch to using a different
collection, you could achieve what you want with solrcloud.

You index to a different collection, which is running on different set
of SolrCloud nodes from your production search. Once indexing is
complete, you create cores on your production boxes for this new
collection. Once indexes have synced, you can switch your app to use
this new collection, thus publishing your new index. You can then delete
the cores on the boxes you were using for indexing.

Now, that's not transparent, but would be do-able.

Upayavira

On Mon, May 6, 2013, at 01:37 PM, David Parks wrote:
> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just
> off-loading
> the whole indexing process. We may just want to re-index the whole thing
> to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization
> right
> now.
> 
> To re-index the whole thing is a 5-10 hour process for us, so when we
> move
> some update to production that requires full re-indexing (every week or
> so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
> 
> It doesn't sound like there are any pre-built processes for doing this
> now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Furkan KAMACI [mailto:furkankamaci@gmail.com] 
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
> 
> Hi Erick;
> 
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders
> you
> have at your SolrCloud, isn't it?
> 
> 2013/5/6 Erick Erickson <er...@gmail.com>
> 
> > The only problem with using Hadoop (or whatever) is that you need to 
> > be sure that documents end up on the same shard, which means that you 
> > have to use the same routing mechanism that SolrCloud uses. The custom 
> > doc routing may help here....
> >
> > My very first question, though, would be whether this is necessary.
> > It might be sufficient to just throttle the rate of indexing, or just 
> > do the indexing during off hours or.... Have you measured an indexing 
> > degradation during your heavy indexing? Indexing has costs, no 
> > question, but it's worth asking whether the costs are heavy enough to 
> > be worth the bother..
> >
> > Best
> > Erick
> >
> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com>
> > wrote:
> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > > use Map/Reduce jobs you split your workload, process it, and then 
> > > reduce step takes into account. Let me explain you new SolrCloud 
> > > architecture. You start your SolrCluoud with a numShards parameter. 
> > > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > > your SolrCloud. These
> > leaders
> > > will be responsible for indexing your data. It means that your 
> > > indexing workload will divided into 5 so it means that you have 
> > > parallelized your data as like Map/Reduce jobs.
> > >
> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > > They will be added as a replica for each shard. Then you will have 5 
> > > shards, 5 leaders of them and every shard has 2 replica. When you 
> > > send a query into a SolrCloud every replica will help you for 
> > > searching and if
> > you
> > > add more replicas to your SolrCloud your search performance will
> improve.
> > >
> > >
> > > 2013/5/6 David Parks <da...@yahoo.com>
> > >
> > >> I've had trouble figuring out what options exist if I want to 
> > >> perform
> > all
> > >> indexing off of the production servers (I'd like to keep them only 
> > >> for
> > user
> > >> queries).
> > >>
> > >>
> > >>
> > >> We index data in batches roughly daily, ideally I'd index all solr 
> > >> cloud shards offline, then move the final index files to the solr 
> > >> cloud
> > instance
> > >> that needs it and flip a switch and have it use the new index.
> > >>
> > >>
> > >>
> > >> Is this possible via either:
> > >>
> > >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
> > a
> > >> significant investment in a hadoop cluster already), or
> > >>
> > >> 2.       Maintaining a separate "master" server that handles indexing
> > and
> > >> the nodes that receive user queries update their index from there 
> > >> (I
> > seem
> > >> to
> > >> recall reading about this configuration in 3.x, but now we're using 
> > >> solr
> > >> cloud)
> > >>
> > >>
> > >>
> > >> Is there some ideal solution I can use to "protect" the production 
> > >> solr instances from degraded performance during large index 
> > >> processing
> > periods?
> > >>
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> David
> > >>
> > >>
> >
> 

Re: Indexing off of the production servers

Posted by Erick Erickson <er...@gmail.com>.
bq:  I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud.

You can still do this in Solr4 if you choose, but not in cloud mode. The
tradeoff is that you sacrifice the automatic fail-over etc if you use Solr4
in non-cloud mode. But in non-cloud mode it's just like 3.x in this
respect.

You could, in fact, take total control of this via HTTP commands, see:
http://wiki.apache.org/solr/SolrReplication#HTTP_API
So you can just turn replication completely off on your master, do your
indexing, then turn replication back on via HTTP commands. You lose
the automatic sharding (i.e you have to take care to send the docs to
the right shards) and you lose the automatic fail-over etc from SolrCloud.

Otherwise, Upayavira's comments might be where you want to go....

FWIW,
Erick

On Mon, May 6, 2013 at 8:37 AM, David Parks <da...@yahoo.com> wrote:
> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just off-loading
> the whole indexing process. We may just want to re-index the whole thing to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization right
> now.
>
> To re-index the whole thing is a 5-10 hour process for us, so when we move
> some update to production that requires full re-indexing (every week or so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
>
> It doesn't sound like there are any pre-built processes for doing this now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
>
> Dave
>
>
> -----Original Message-----
> From: Furkan KAMACI [mailto:furkankamaci@gmail.com]
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
>
> Hi Erick;
>
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders you
> have at your SolrCloud, isn't it?
>
> 2013/5/6 Erick Erickson <er...@gmail.com>
>
>> The only problem with using Hadoop (or whatever) is that you need to
>> be sure that documents end up on the same shard, which means that you
>> have to use the same routing mechanism that SolrCloud uses. The custom
>> doc routing may help here....
>>
>> My very first question, though, would be whether this is necessary.
>> It might be sufficient to just throttle the rate of indexing, or just
>> do the indexing during off hours or.... Have you measured an indexing
>> degradation during your heavy indexing? Indexing has costs, no
>> question, but it's worth asking whether the costs are heavy enough to
>> be worth the bother..
>>
>> Best
>> Erick
>>
>> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com>
>> wrote:
>> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
>> > use Map/Reduce jobs you split your workload, process it, and then
>> > reduce step takes into account. Let me explain you new SolrCloud
>> > architecture. You start your SolrCluoud with a numShards parameter.
>> > Let's assume that you have 5 shards. Then you will have 5 leader at
>> > your SolrCloud. These
>> leaders
>> > will be responsible for indexing your data. It means that your
>> > indexing workload will divided into 5 so it means that you have
>> > parallelized your data as like Map/Reduce jobs.
>> >
>> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
>> > They will be added as a replica for each shard. Then you will have 5
>> > shards, 5 leaders of them and every shard has 2 replica. When you
>> > send a query into a SolrCloud every replica will help you for
>> > searching and if
>> you
>> > add more replicas to your SolrCloud your search performance will
> improve.
>> >
>> >
>> > 2013/5/6 David Parks <da...@yahoo.com>
>> >
>> >> I've had trouble figuring out what options exist if I want to
>> >> perform
>> all
>> >> indexing off of the production servers (I'd like to keep them only
>> >> for
>> user
>> >> queries).
>> >>
>> >>
>> >>
>> >> We index data in batches roughly daily, ideally I'd index all solr
>> >> cloud shards offline, then move the final index files to the solr
>> >> cloud
>> instance
>> >> that needs it and flip a switch and have it use the new index.
>> >>
>> >>
>> >>
>> >> Is this possible via either:
>> >>
>> >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
>> a
>> >> significant investment in a hadoop cluster already), or
>> >>
>> >> 2.       Maintaining a separate "master" server that handles indexing
>> and
>> >> the nodes that receive user queries update their index from there
>> >> (I
>> seem
>> >> to
>> >> recall reading about this configuration in 3.x, but now we're using
>> >> solr
>> >> cloud)
>> >>
>> >>
>> >>
>> >> Is there some ideal solution I can use to "protect" the production
>> >> solr instances from degraded performance during large index
>> >> processing
>> periods?
>> >>
>> >>
>> >>
>> >> Thanks!
>> >>
>> >> David
>> >>
>> >>
>>
>

Re: Indexing off of the production servers

Posted by Erick Erickson <er...@gmail.com>.
bq:  Your data will be indexed by shard leaders while your replicas
are responsible for querying.

This is not true in SolrCloud mode. When you send a document
to Solr, upon return that document has been sent to every replica
for the appropriate shard and entered in the transaction log. It is
indexed on every node for a given shard.

In SolrCloud, there isn't much distinction between leaders and
replicas. A leader is just a replica with a few additional responsibilities.
One of those responsibilities is insuring that docs with the same
ID sent to several nodes at once are resolved appropriately, which is
why the leader gets the updates forwarded to it. But from that point,
the doc is sent to every replica associated with that leader (shard)
and indexed there.

The bits about SolrJ being "leader aware" are partly in place, but
currently the docs are sent to _a_ leader, not necessarily the
leader of the shard they will eventually end up on. That's on the
roadmap, but not there yet.

FWIW,
Erick

On Mon, May 6, 2013 at 9:03 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> Hi Dave;
>
> I think that when you do indexing you can use CloudSolrServer so you can
> learn from Zookeeper that where you data will go and then send your data to
> there. This will speed up you when indexing and gives benefit of
> Map/Reduce. Your data will be indexed by shard leaders while your replicas
> are responsible for querying. Also even if you are not satisfied with you
> query performance you can add more replica. If you want to improve your
> indexing you can define more shards at your system (beginning with Solr 4.3
> shard splitting will be a new feature for Solr.)
>
> 2013/5/6 David Parks <da...@yahoo.com>
>
>> I'm less concerned with fully utilizing a hadoop cluster (due to having
>> fewer shards than I have hadoop reduce slots) as I am with just off-loading
>> the whole indexing process. We may just want to re-index the whole thing to
>> add some index time boosts or whatever else we conjure up to make queries
>> faster and better quality. We're doing a lot of work on optimization right
>> now.
>>
>> To re-index the whole thing is a 5-10 hour process for us, so when we move
>> some update to production that requires full re-indexing (every week or
>> so),
>> right now we're just re-building new instances of solr to handle the
>> re-indexing and then copying the final VMs to the production environment
>> (slow process). I'm leery of letting a heavy duty full re-index process
>> loose for 10 hours on production on a regular basis.
>>
>> It doesn't sound like there are any pre-built processes for doing this now
>> though. I thought I had heard of master/slave hierarchy in 3.x that would
>> allow us to designate a master to do indexing and let the slaves pull
>> finished indexes from the master, so I thought maybe something like that
>> followed into solr cloud. Eric might be right in that it's not worth the
>> effort if there isn't some existing strategy.
>>
>> Dave
>>
>>
>> -----Original Message-----
>> From: Furkan KAMACI [mailto:furkankamaci@gmail.com]
>> Sent: Monday, May 06, 2013 7:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing off of the production servers
>>
>> Hi Erick;
>>
>> I think that even if you use Map/Reduce you will not parallelize you
>> indexing because indexing will parallelize as much as how many leaders you
>> have at your SolrCloud, isn't it?
>>
>> 2013/5/6 Erick Erickson <er...@gmail.com>
>>
>> > The only problem with using Hadoop (or whatever) is that you need to
>> > be sure that documents end up on the same shard, which means that you
>> > have to use the same routing mechanism that SolrCloud uses. The custom
>> > doc routing may help here....
>> >
>> > My very first question, though, would be whether this is necessary.
>> > It might be sufficient to just throttle the rate of indexing, or just
>> > do the indexing during off hours or.... Have you measured an indexing
>> > degradation during your heavy indexing? Indexing has costs, no
>> > question, but it's worth asking whether the costs are heavy enough to
>> > be worth the bother..
>> >
>> > Best
>> > Erick
>> >
>> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com>
>> > wrote:
>> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
>> > > use Map/Reduce jobs you split your workload, process it, and then
>> > > reduce step takes into account. Let me explain you new SolrCloud
>> > > architecture. You start your SolrCluoud with a numShards parameter.
>> > > Let's assume that you have 5 shards. Then you will have 5 leader at
>> > > your SolrCloud. These
>> > leaders
>> > > will be responsible for indexing your data. It means that your
>> > > indexing workload will divided into 5 so it means that you have
>> > > parallelized your data as like Map/Reduce jobs.
>> > >
>> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
>> > > They will be added as a replica for each shard. Then you will have 5
>> > > shards, 5 leaders of them and every shard has 2 replica. When you
>> > > send a query into a SolrCloud every replica will help you for
>> > > searching and if
>> > you
>> > > add more replicas to your SolrCloud your search performance will
>> improve.
>> > >
>> > >
>> > > 2013/5/6 David Parks <da...@yahoo.com>
>> > >
>> > >> I've had trouble figuring out what options exist if I want to
>> > >> perform
>> > all
>> > >> indexing off of the production servers (I'd like to keep them only
>> > >> for
>> > user
>> > >> queries).
>> > >>
>> > >>
>> > >>
>> > >> We index data in batches roughly daily, ideally I'd index all solr
>> > >> cloud shards offline, then move the final index files to the solr
>> > >> cloud
>> > instance
>> > >> that needs it and flip a switch and have it use the new index.
>> > >>
>> > >>
>> > >>
>> > >> Is this possible via either:
>> > >>
>> > >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we
>> have
>> > a
>> > >> significant investment in a hadoop cluster already), or
>> > >>
>> > >> 2.       Maintaining a separate "master" server that handles indexing
>> > and
>> > >> the nodes that receive user queries update their index from there
>> > >> (I
>> > seem
>> > >> to
>> > >> recall reading about this configuration in 3.x, but now we're using
>> > >> solr
>> > >> cloud)
>> > >>
>> > >>
>> > >>
>> > >> Is there some ideal solution I can use to "protect" the production
>> > >> solr instances from degraded performance during large index
>> > >> processing
>> > periods?
>> > >>
>> > >>
>> > >>
>> > >> Thanks!
>> > >>
>> > >> David
>> > >>
>> > >>
>> >
>>
>>

Re: Indexing off of the production servers

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Dave;

I think that when you do indexing you can use CloudSolrServer so you can
learn from Zookeeper that where you data will go and then send your data to
there. This will speed up you when indexing and gives benefit of
Map/Reduce. Your data will be indexed by shard leaders while your replicas
are responsible for querying. Also even if you are not satisfied with you
query performance you can add more replica. If you want to improve your
indexing you can define more shards at your system (beginning with Solr 4.3
shard splitting will be a new feature for Solr.)

2013/5/6 David Parks <da...@yahoo.com>

> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just off-loading
> the whole indexing process. We may just want to re-index the whole thing to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization right
> now.
>
> To re-index the whole thing is a 5-10 hour process for us, so when we move
> some update to production that requires full re-indexing (every week or
> so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
>
> It doesn't sound like there are any pre-built processes for doing this now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
>
> Dave
>
>
> -----Original Message-----
> From: Furkan KAMACI [mailto:furkankamaci@gmail.com]
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
>
> Hi Erick;
>
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders you
> have at your SolrCloud, isn't it?
>
> 2013/5/6 Erick Erickson <er...@gmail.com>
>
> > The only problem with using Hadoop (or whatever) is that you need to
> > be sure that documents end up on the same shard, which means that you
> > have to use the same routing mechanism that SolrCloud uses. The custom
> > doc routing may help here....
> >
> > My very first question, though, would be whether this is necessary.
> > It might be sufficient to just throttle the rate of indexing, or just
> > do the indexing during off hours or.... Have you measured an indexing
> > degradation during your heavy indexing? Indexing has costs, no
> > question, but it's worth asking whether the costs are heavy enough to
> > be worth the bother..
> >
> > Best
> > Erick
> >
> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com>
> > wrote:
> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
> > > use Map/Reduce jobs you split your workload, process it, and then
> > > reduce step takes into account. Let me explain you new SolrCloud
> > > architecture. You start your SolrCluoud with a numShards parameter.
> > > Let's assume that you have 5 shards. Then you will have 5 leader at
> > > your SolrCloud. These
> > leaders
> > > will be responsible for indexing your data. It means that your
> > > indexing workload will divided into 5 so it means that you have
> > > parallelized your data as like Map/Reduce jobs.
> > >
> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > > They will be added as a replica for each shard. Then you will have 5
> > > shards, 5 leaders of them and every shard has 2 replica. When you
> > > send a query into a SolrCloud every replica will help you for
> > > searching and if
> > you
> > > add more replicas to your SolrCloud your search performance will
> improve.
> > >
> > >
> > > 2013/5/6 David Parks <da...@yahoo.com>
> > >
> > >> I've had trouble figuring out what options exist if I want to
> > >> perform
> > all
> > >> indexing off of the production servers (I'd like to keep them only
> > >> for
> > user
> > >> queries).
> > >>
> > >>
> > >>
> > >> We index data in batches roughly daily, ideally I'd index all solr
> > >> cloud shards offline, then move the final index files to the solr
> > >> cloud
> > instance
> > >> that needs it and flip a switch and have it use the new index.
> > >>
> > >>
> > >>
> > >> Is this possible via either:
> > >>
> > >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we
> have
> > a
> > >> significant investment in a hadoop cluster already), or
> > >>
> > >> 2.       Maintaining a separate "master" server that handles indexing
> > and
> > >> the nodes that receive user queries update their index from there
> > >> (I
> > seem
> > >> to
> > >> recall reading about this configuration in 3.x, but now we're using
> > >> solr
> > >> cloud)
> > >>
> > >>
> > >>
> > >> Is there some ideal solution I can use to "protect" the production
> > >> solr instances from degraded performance during large index
> > >> processing
> > periods?
> > >>
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> David
> > >>
> > >>
> >
>
>

RE: Indexing off of the production servers

Posted by David Parks <da...@yahoo.com>.
I'm less concerned with fully utilizing a hadoop cluster (due to having
fewer shards than I have hadoop reduce slots) as I am with just off-loading
the whole indexing process. We may just want to re-index the whole thing to
add some index time boosts or whatever else we conjure up to make queries
faster and better quality. We're doing a lot of work on optimization right
now.

To re-index the whole thing is a 5-10 hour process for us, so when we move
some update to production that requires full re-indexing (every week or so),
right now we're just re-building new instances of solr to handle the
re-indexing and then copying the final VMs to the production environment
(slow process). I'm leery of letting a heavy duty full re-index process
loose for 10 hours on production on a regular basis.

It doesn't sound like there are any pre-built processes for doing this now
though. I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud. Eric might be right in that it's not worth the
effort if there isn't some existing strategy.

Dave


-----Original Message-----
From: Furkan KAMACI [mailto:furkankamaci@gmail.com] 
Sent: Monday, May 06, 2013 7:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson <er...@gmail.com>

> The only problem with using Hadoop (or whatever) is that you need to 
> be sure that documents end up on the same shard, which means that you 
> have to use the same routing mechanism that SolrCloud uses. The custom 
> doc routing may help here....
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just 
> do the indexing during off hours or.... Have you measured an indexing 
> degradation during your heavy indexing? Indexing has costs, no 
> question, but it's worth asking whether the costs are heavy enough to 
> be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > use Map/Reduce jobs you split your workload, process it, and then 
> > reduce step takes into account. Let me explain you new SolrCloud 
> > architecture. You start your SolrCluoud with a numShards parameter. 
> > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your 
> > indexing workload will divided into 5 so it means that you have 
> > parallelized your data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5 
> > shards, 5 leaders of them and every shard has 2 replica. When you 
> > send a query into a SolrCloud every replica will help you for 
> > searching and if
> you
> > add more replicas to your SolrCloud your search performance will
improve.
> >
> >
> > 2013/5/6 David Parks <da...@yahoo.com>
> >
> >> I've had trouble figuring out what options exist if I want to 
> >> perform
> all
> >> indexing off of the production servers (I'd like to keep them only 
> >> for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr 
> >> cloud shards offline, then move the final index files to the solr 
> >> cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.       Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there 
> >> (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using 
> >> solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production 
> >> solr instances from degraded performance during large index 
> >> processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>


Re: Indexing off of the production servers

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson <er...@gmail.com>

> The only problem with using Hadoop (or whatever) is that you
> need to be sure that documents end up on the same shard, which
> means that you have to use the same routing mechanism that
> SolrCloud uses. The custom doc routing may help here....
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just do
> the
> indexing during off hours or.... Have you measured an indexing
> degradation during your heavy indexing? Indexing has costs, no
> question, but it's worth asking whether the costs are heavy enough
> to be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
> > Map/Reduce jobs you split your workload, process it, and then reduce step
> > takes into account. Let me explain you new SolrCloud architecture. You
> > start your SolrCluoud with a numShards parameter. Let's assume that you
> > have 5 shards. Then you will have 5 leader at your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your indexing
> > workload will divided into 5 so it means that you have parallelized your
> > data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5
> > shards, 5 leaders of them and every shard has 2 replica. When you send a
> > query into a SolrCloud every replica will help you for searching and if
> you
> > add more replicas to your SolrCloud your search performance will improve.
> >
> >
> > 2013/5/6 David Parks <da...@yahoo.com>
> >
> >> I've had trouble figuring out what options exist if I want to perform
> all
> >> indexing off of the production servers (I'd like to keep them only for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr cloud
> >> shards offline, then move the final index files to the solr cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.       Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production solr
> >> instances from degraded performance during large index processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>

Re: Indexing off of the production servers

Posted by Erick Erickson <er...@gmail.com>.
The only problem with using Hadoop (or whatever) is that you
need to be sure that documents end up on the same shard, which
means that you have to use the same routing mechanism that
SolrCloud uses. The custom doc routing may help here....

My very first question, though, would be whether this is necessary.
It might be sufficient to just throttle the rate of indexing, or just do the
indexing during off hours or.... Have you measured an indexing
degradation during your heavy indexing? Indexing has costs, no
question, but it's worth asking whether the costs are heavy enough
to be worth the bother..

Best
Erick

On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
> Map/Reduce jobs you split your workload, process it, and then reduce step
> takes into account. Let me explain you new SolrCloud architecture. You
> start your SolrCluoud with a numShards parameter. Let's assume that you
> have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders
> will be responsible for indexing your data. It means that your indexing
> workload will divided into 5 so it means that you have parallelized your
> data as like Map/Reduce jobs.
>
> Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> They will be added as a replica for each shard. Then you will have 5
> shards, 5 leaders of them and every shard has 2 replica. When you send a
> query into a SolrCloud every replica will help you for searching and if you
> add more replicas to your SolrCloud your search performance will improve.
>
>
> 2013/5/6 David Parks <da...@yahoo.com>
>
>> I've had trouble figuring out what options exist if I want to perform all
>> indexing off of the production servers (I'd like to keep them only for user
>> queries).
>>
>>
>>
>> We index data in batches roughly daily, ideally I'd index all solr cloud
>> shards offline, then move the final index files to the solr cloud instance
>> that needs it and flip a switch and have it use the new index.
>>
>>
>>
>> Is this possible via either:
>>
>> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have a
>> significant investment in a hadoop cluster already), or
>>
>> 2.       Maintaining a separate "master" server that handles indexing and
>> the nodes that receive user queries update their index from there (I seem
>> to
>> recall reading about this configuration in 3.x, but now we're using solr
>> cloud)
>>
>>
>>
>> Is there some ideal solution I can use to "protect" the production solr
>> instances from degraded performance during large index processing periods?
>>
>>
>>
>> Thanks!
>>
>> David
>>
>>

Re: Indexing off of the production servers

Posted by Furkan KAMACI <fu...@gmail.com>.
1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
Map/Reduce jobs you split your workload, process it, and then reduce step
takes into account. Let me explain you new SolrCloud architecture. You
start your SolrCluoud with a numShards parameter. Let's assume that you
have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders
will be responsible for indexing your data. It means that your indexing
workload will divided into 5 so it means that you have parallelized your
data as like Map/Reduce jobs.

Let's assume that you have added 10 new Solr nodes into your SolrCloud.
They will be added as a replica for each shard. Then you will have 5
shards, 5 leaders of them and every shard has 2 replica. When you send a
query into a SolrCloud every replica will help you for searching and if you
add more replicas to your SolrCloud your search performance will improve.


2013/5/6 David Parks <da...@yahoo.com>

> I've had trouble figuring out what options exist if I want to perform all
> indexing off of the production servers (I'd like to keep them only for user
> queries).
>
>
>
> We index data in batches roughly daily, ideally I'd index all solr cloud
> shards offline, then move the final index files to the solr cloud instance
> that needs it and flip a switch and have it use the new index.
>
>
>
> Is this possible via either:
>
> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have a
> significant investment in a hadoop cluster already), or
>
> 2.       Maintaining a separate "master" server that handles indexing and
> the nodes that receive user queries update their index from there (I seem
> to
> recall reading about this configuration in 3.x, but now we're using solr
> cloud)
>
>
>
> Is there some ideal solution I can use to "protect" the production solr
> instances from degraded performance during large index processing periods?
>
>
>
> Thanks!
>
> David
>
>