You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2011/12/01 16:08:53 UTC

Configuring the Distributed

I am currently looking at the latest solrcloud branch and was
wondering if there was any documentation on configuring the
DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
to be added/modified to make distributed indexing work?

Re: Configuring the Distributed

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com> wrote:
> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com> wrote:
>
>> I am currently looking at the latest solrcloud branch and was
>> wondering if there was any documentation on configuring the
>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>> to be added/modified to make distributed indexing work?
>>
>
>
> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> solr/core/src/test-files
>
> You need to enable the update log, add an empty replication handler def,
> and an update chain with solr.DistributedUpdateProcessFactory in it.

One also needs an indexed _version_ field defined in schema.xml for
versioning to work.

-Yonik
http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Yes completely agree, just wanted to make sure I wasn't missing the obvious :)

On Mon, Dec 5, 2011 at 1:39 PM, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Mon, Dec 5, 2011 at 1:29 PM, Jamie Johnson <je...@gmail.com> wrote:
>> In this
>> situation I don't think splitting one shard would help us we'd need to
>> split every shard to reduce the load on the burdened systems right?
>
> Sure... but if you can split one, you can split them all :-)
>
> -Yonik
> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Dec 5, 2011 at 1:29 PM, Jamie Johnson <je...@gmail.com> wrote:
> In this
> situation I don't think splitting one shard would help us we'd need to
> split every shard to reduce the load on the burdened systems right?

Sure... but if you can split one, you can split them all :-)

-Yonik
http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Thanks Yonik, must have just missed it.

A question about adding a new shard to the index.  I am definitely not
a hashing expert, but the goal is to have a uniform distribution of
buckets based on what we're hashing.  If that happens then our shards
would reach capacity at approximately the same time.  In this
situation I don't think splitting one shard would help us we'd need to
split every shard to reduce the load on the burdened systems right?

On Mon, Dec 5, 2011 at 9:45 AM, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Mon, Dec 5, 2011 at 9:21 AM, Jamie Johnson <je...@gmail.com> wrote:
>> What does the version field need to look like?
>
> It's in the example schema:
>   <field name="_version_" type="long" indexed="true" stored="true"/>
>
> -Yonik
> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Dec 5, 2011 at 9:21 AM, Jamie Johnson <je...@gmail.com> wrote:
> What does the version field need to look like?

It's in the example schema:
   <field name="_version_" type="long" indexed="true" stored="true"/>

-Yonik
http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

What does the version field need to look like?  Something like?

   <field name="_version_" type="string" indexed="true" stored="true"
required="true" />

On Sun, Dec 4, 2011 at 2:00 PM, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com> wrote:
>> You always want to use the distrib-update-chain. Eventually it will
>> probably be part of the default chain and auto turn in zk mode.
>
> I'm working on this now...
>
> -Yonik
> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com> wrote:
> You always want to use the distrib-update-chain. Eventually it will
> probably be part of the default chain and auto turn in zk mode.

I'm working on this now...

-Yonik
http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

On Sat, Dec 3, 2011 at 1:31 PM, Jamie Johnson <je...@gmail.com> wrote:

> Again great stuff.  Once distributed update/delete works (sounds like
> it's not far off)


Yeah, I only realized it was not working with the Version code on Friday as
I started adding tests for it - the work to fix it is not too difficult.


> I'll have to reevaluate our current stack.
>
> You had mentioned storing the shad hash assignments in ZK, is there a
> JIRA around this?
>

I don't think there is specifically one for that yet - though it could just
be part of the index splitting JIRA issue I suppose.


>
> I'll keep my eyes on the JIRA tickets.  Right now the distirbuted
> updated/delete are big ones for me, the rebalancing of the cluster is
> a nice to have, but hopefully I wouldn't need that capability anytime
> in the near future.
>
> One other question....my current setup has replication done on a
> polling setup, I didn't notice that in the updated solrconfig, how
> does this work now?
>

Replication is only used for recovery, because it doesn't work with Near
Realtime. We want SolrCloud to work with NRT, so currently, the leader
versions documents and forward the docs to the replicas (these will let us
do optimistic locking as well). If you send a doc to a replica instead, its
first forwarded to the leader to get versioned. The SolrCloud solrj client
will likely be smart enough to just send to the leader first.

You need the replication handler defined for recovery - when a replica goes
down and then come back up, it starts buffering updates and replicates from
the leader - then it applies the buffered updates and ends up current with
the leader.

- Mark


>
> On Sat, Dec 3, 2011 at 9:00 AM, Mark Miller <ma...@gmail.com> wrote:
> > bq. A few questions if a master goes down does a replica get
> > promoted?
> >
> > Right - if the leader goes down there is a leader election and one of the
> > replicas takes over.
> >
> > bq.  If a new shard needs to be added is it just a matter of
> > starting a new solr instance with a higher numShards?
> >
> > Eventually, that's the plan.
> >
> > The idea is, you say something like, I want 3 shards. Now if you start
> up 9
> > instances, the first 3 end up as shard leaders - the next 6 evenly come
> up
> > as replicas for each shard.
> >
> > To change the numShards, we will need some kind of micro shards /
> splitting
> > / rebalancing.
> >
> > bq. Last question, how do you change numShards?
> >
> > I think this is somewhat a work in progress, but I think Sami just made
> it
> > so that numShards is stored on the collection node in zk (along with
> which
> > config set to use). So you would change it there presumably. Or perhaps
> > just start up a new server with an update numShards property and then it
> > would realize that needs to be a new leader - of course then you'd want
> to
> > rebalance probably - unless you fired up enough servers to add replicas
> > too...
> >
> > bq. Is that right?
> >
> > Yup, sounds about right.
> >
> >
> > On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >
> >> So I just tried this out, seems like it does the things I asked about.
> >>
> >> Really really cool stuff, it's progressed quite a bit in the time
> >> since I took a snapshot of the branch.
> >>
> >> Last question, how do you change numShards?  Is there a command you
> >> can use to do this now? I understand there will be implications for
> >> the hashing algorithm, but once the hash ranges are stored in ZK (is
> >> there a separate JIRA for this or does this fall under 2358) I assume
> >> that it would be a relatively simple index split (JIRA 2595?) and
> >> updating the hash ranges in solr, essentially splitting the range
> >> between the new and existing shard.  Is that right?
> >>
> >> On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >> > I think I see it.....so if I understand this correctly you specify
> >> > numShards as a system property, as new nodes come up they check ZK to
> >> > see if they should be a new shard or a replica based on if numShards
> >> > is met.  A few questions if a master goes down does a replica get
> >> > promoted?  If a new shard needs to be added is it just a matter of
> >> > starting a new solr instance with a higher numShards?  (understanding
> >> > that index rebalancing does not happen automatically now, but
> >> > presumably it could).
> >> >
> >> > On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >> >> How does it determine the number of shards to create?  How many
> >> >> replicas to create?
> >> >>
> >> >> On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller <ma...@gmail.com>
> >> wrote:
> >> >>> Ah, okay - you are setting the shards in solr.xml - thats still an
> >> option
> >> >>> to force a node to a particular shard - but if you take that out,
> >> shards
> >> >>> will be auto assigned.
> >> >>>
> >> >>> By the way, because of the version code, distrib deletes don't work
> at
> >> the
> >> >>> moment - will get to that next week.
> >> >>>
> >> >>> - Mark
> >> >>>
> >> >>> On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >> >>>
> >> >>>> So I'm a fool.  I did set the numShards, the issue was so trivial
> it's
> >> >>>> embarrassing.  I did indeed have it setup as a replica, the shard
> >> >>>> names in solr.xml were both shard1.  This worked as I expected now.
> >> >>>>
> >> >>>> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <markrmiller@gmail.com
> >
> >> wrote:
> >> >>>> >
> >> >>>> > They are unused params, so removing them wouldn't help anything.
> >> >>>> >
> >> >>>> > You might just want to wait till we are further along before
> playing
> >> >>>> with it.
> >> >>>> >
> >> >>>> > Or if you submit your full self contained test, I can see what's
> >> going
> >> >>>> on (eg its still unclear if you have started setting numShards?).
> >> >>>> >
> >> >>>> > I can do a similar set of actions in my tests and it works fine.
> The
> >> >>>> only reason I could see things working like this is if it thinks
> you
> >> have
> >> >>>> one shard - a leader and a replica.
> >> >>>> >
> >> >>>> > - Mark
> >> >>>> >
> >> >>>> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
> >> >>>> >
> >> >>>> >> Glad to hear I don't need to set shards/self, but removing them
> >> didn't
> >> >>>> >> seem to change what I'm seeing.  Doing this still results in 2
> >> >>>> >> documents 1 on 8983 and 1 on 7574.
> >> >>>> >>
> >> >>>> >> String key = "1";
> >> >>>> >>
> >> >>>> >>               SolrInputDocument solrDoc = new
> SolrInputDocument();
> >> >>>> >>               solrDoc.setField("key", key);
> >> >>>> >>
> >> >>>> >>               solrDoc.addField("content_mvtxt", "initial
> value");
> >> >>>> >>
> >> >>>> >>               SolrServer server = servers.get("
> >> >>>> http://localhost:8983/solr/collection1");
> >> >>>> >>
> >> >>>> >>               UpdateRequest ureq = new UpdateRequest();
> >> >>>> >>               ureq.setParam("update.chain",
> >> "distrib-update-chain");
> >> >>>> >>               ureq.add(solrDoc);
> >> >>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
> >> >>>> >>               server.request(ureq);
> >> >>>> >>               server.commit();
> >> >>>> >>
> >> >>>> >>               solrDoc = new SolrInputDocument();
> >> >>>> >>               solrDoc.addField("key", key);
> >> >>>> >>               solrDoc.addField("content_mvtxt", "updated
> value");
> >> >>>> >>
> >> >>>> >>               server = servers.get("
> >> >>>> http://localhost:7574/solr/collection1");
> >> >>>> >>
> >> >>>> >>               ureq = new UpdateRequest();
> >> >>>> >>               ureq.setParam("update.chain",
> >> "distrib-update-chain");
> >> >>>> >>               ureq.add(solrDoc);
> >> >>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
> >> >>>> >>               server.request(ureq);
> >> >>>> >>               server.commit();
> >> >>>> >>
> >> >>>> >>               server = servers.get("
> >> >>>> http://localhost:8983/solr/collection1");
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>               server.commit();
> >> >>>> >>               System.out.println("done");
> >> >>>> >>
> >> >>>> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <
> >> markrmiller@gmail.com>
> >> >>>> wrote:
> >> >>>> >>> So I dunno. You are running a zk server and running in zk mode
> >> right?
> >> >>>> >>>
> >> >>>> >>> You don't need to / shouldn't set a shards or self param. The
> >> shards
> >> >>>> are
> >> >>>> >>> figured out from Zookeeper.
> >> >>>> >>>
> >> >>>> >>> You always want to use the distrib-update-chain. Eventually it
> >> will
> >> >>>> >>> probably be part of the default chain and auto turn in zk mode.
> >> >>>> >>>
> >> >>>> >>> If you are running in zk mode attached to a zk server, this
> should
> >> >>>> work no
> >> >>>> >>> problem. You can add docs to any server and they will be
> >> forwarded to
> >> >>>> the
> >> >>>> >>> correct shard leader and then versioned and forwarded to
> replicas.
> >> >>>> >>>
> >> >>>> >>> You can also use the CloudSolrServer solrj client - that way
> you
> >> don't
> >> >>>> even
> >> >>>> >>> have to choose a server to send docs too - in which case if it
> >> went
> >> >>>> down
> >> >>>> >>> you would have to choose another manually - CloudSolrServer
> >> >>>> automatically
> >> >>>> >>> finds one that is up through ZooKeeper. Eventually it will
> also be
> >> >>>> smart
> >> >>>> >>> and do the hashing itself so that it can send directly to the
> >> shard
> >> >>>> leader
> >> >>>> >>> that the doc would be forwarded to anyway.
> >> >>>> >>>
> >> >>>> >>> - Mark
> >> >>>> >>>
> >> >>>> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <
> jej2003@gmail.com
> >> >
> >> >>>> wrote:
> >> >>>> >>>
> >> >>>> >>>> Really just trying to do a simple add and update test, the
> chain
> >> >>>> >>>> missing is just proof of my not understanding exactly how
> this is
> >> >>>> >>>> supposed to work.  I modified the code to this
> >> >>>> >>>>
> >> >>>> >>>>                String key = "1";
> >> >>>> >>>>
> >> >>>> >>>>                SolrInputDocument solrDoc = new
> >> SolrInputDocument();
> >> >>>> >>>>                solrDoc.setField("key", key);
> >> >>>> >>>>
> >> >>>> >>>>                 solrDoc.addField("content_mvtxt", "initial
> >> value");
> >> >>>> >>>>
> >> >>>> >>>>                SolrServer server = servers
> >> >>>> >>>>                                .get("
> >> >>>> >>>> http://localhost:8983/solr/collection1");
> >> >>>> >>>>
> >> >>>> >>>>                 UpdateRequest ureq = new UpdateRequest();
> >> >>>> >>>>                ureq.setParam("update.chain",
> >> "distrib-update-chain");
> >> >>>> >>>>                ureq.add(solrDoc);
> >> >>>> >>>>                ureq.setParam("shards",
> >> >>>> >>>>
> >> >>>> >>>>
> >>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >> >>>> >>>>                ureq.setParam("self", "foo");
> >> >>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
> >> >>>> >>>>                server.request(ureq);
> >> >>>> >>>>                 server.commit();
> >> >>>> >>>>
> >> >>>> >>>>                solrDoc = new SolrInputDocument();
> >> >>>> >>>>                solrDoc.addField("key", key);
> >> >>>> >>>>                 solrDoc.addField("content_mvtxt", "updated
> >> value");
> >> >>>> >>>>
> >> >>>> >>>>                server = servers.get("
> >> >>>> >>>> http://localhost:7574/solr/collection1");
> >> >>>> >>>>
> >> >>>> >>>>                 ureq = new UpdateRequest();
> >> >>>> >>>>                ureq.setParam("update.chain",
> >> "distrib-update-chain");
> >> >>>> >>>>                 //
> >> >>>> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
> >> >>>> >>>>                 ureq.add(solrDoc);
> >> >>>> >>>>                ureq.setParam("shards",
> >> >>>> >>>>
> >> >>>> >>>>
> >>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >> >>>> >>>>                ureq.setParam("self", "foo");
> >> >>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
> >> >>>> >>>>                server.request(ureq);
> >> >>>> >>>>                 // server.add(solrDoc);
> >> >>>> >>>>                server.commit();
> >> >>>> >>>>                server = servers.get("
> >> >>>> >>>> http://localhost:8983/solr/collection1");
> >> >>>> >>>>
> >> >>>> >>>>
> >> >>>> >>>>                server.commit();
> >> >>>> >>>>                System.out.println("done");
> >> >>>> >>>>
> >> >>>> >>>> but I'm still seeing the doc appear on both shards.    After
> the
> >> first
> >> >>>> >>>> commit I see the doc on 8983 with "initial value".  after the
> >> second
> >> >>>> >>>> commit I see the updated value on 7574 and the old on 8983.
> >>  After the
> >> >>>> >>>> final commit the doc on 8983 gets updated.
> >> >>>> >>>>
> >> >>>> >>>> Is there something wrong with my test?
> >> >>>> >>>>
> >> >>>> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <
> >> markrmiller@gmail.com>
> >> >>>> >>>> wrote:
> >> >>>> >>>>> Getting late - didn't really pay attention to your code I
> guess
> >> - why
> >> >>>> >>>> are you adding the first doc without specifying the distrib
> >> update
> >> >>>> chain?
> >> >>>> >>>> This is not really supported. It's going to just go to the
> >> server you
> >> >>>> >>>> specified - even with everything setup right, the update might
> >> then
> >> >>>> go to
> >> >>>> >>>> that same server or the other one depending on how it hashes.
> You
> >> >>>> really
> >> >>>> >>>> want to just always use the distrib update chain.  I guess I
> >> don't yet
> >> >>>> >>>> understand what you are trying to test.
> >> >>>> >>>>>
> >> >>>> >>>>> Sent from my iPad
> >> >>>> >>>>>
> >> >>>> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <
> markrmiller@gmail.com
> >> >
> >> >>>> wrote:
> >> >>>> >>>>>
> >> >>>> >>>>>> Not sure offhand - but things will be funky if you don't
> >> specify the
> >> >>>> >>>> correct numShards.
> >> >>>> >>>>>>
> >> >>>> >>>>>> The instance to shard assignment should be using numShards
> to
> >> >>>> assign.
> >> >>>> >>>> But then the hash to shard mapping actually goes on the
> number of
> >> >>>> shards it
> >> >>>> >>>> finds registered in ZK (it doesn't have to, but really these
> >> should be
> >> >>>> >>>> equal).
> >> >>>> >>>>>>
> >> >>>> >>>>>> So basically you are saying, I want 3 partitions, but you
> are
> >> only
> >> >>>> >>>> starting up 2 nodes, and the code is just not happy about that
> >> I'd
> >> >>>> guess.
> >> >>>> >>>> For the system to work properly, you have to fire up at least
> as
> >> many
> >> >>>> >>>> servers as numShards.
> >> >>>> >>>>>>
> >> >>>> >>>>>> What are you trying to do? 2 partitions with no replicas, or
> >> one
> >> >>>> >>>> partition with one replica?
> >> >>>> >>>>>>
> >> >>>> >>>>>> In either case, I think you will have better luck if you
> fire
> >> up at
> >> >>>> >>>> least as many servers as the numShards setting. Or lower the
> >> numShards
> >> >>>> >>>> setting.
> >> >>>> >>>>>>
> >> >>>> >>>>>> This is all a work in progress by the way - what you are
> >> trying to
> >> >>>> test
> >> >>>> >>>> should work if things are setup right though.
> >> >>>> >>>>>>
> >> >>>> >>>>>> - Mark
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
> >> >>>> >>>>>>
> >> >>>> >>>>>>> Thanks for the quick response.  With that change (have not
> >> done
> >> >>>> >>>>>>> numShards yet) shard1 got updated.  But now when executing
> the
> >> >>>> >>>>>>> following queries I get information back from both, which
> >> doesn't
> >> >>>> seem
> >> >>>> >>>>>>> right
> >> >>>> >>>>>>>
> >> >>>> >>>>>>> http://localhost:7574/solr/select/?q=*:*
> >> >>>> >>>>>>> <doc><str name="key">1</str><str
> name="content_mvtxt">updated
> >> >>>> >>>> value</str></doc>
> >> >>>> >>>>>>>
> >> >>>> >>>>>>> http://localhost:8983/solr/select?q=*:*
> >> >>>> >>>>>>> <doc><str name="key">1</str><str
> name="content_mvtxt">updated
> >> >>>> >>>> value</str></doc>
> >> >>>> >>>>>>>
> >> >>>> >>>>>>>
> >> >>>> >>>>>>>
> >> >>>> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
> >> >>>> markrmiller@gmail.com>
> >> >>>> >>>> wrote:
> >> >>>> >>>>>>>> Hmm...sorry bout that - so my first guess is that right
> now
> >> we are
> >> >>>> >>>> not distributing a commit (easy to add, just have not done
> it).
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> Right now I explicitly commit on each server for tests.
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> Can you try explicitly committing on server1 after
> updating
> >> the
> >> >>>> doc
> >> >>>> >>>> on server 2?
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> I can start distributing commits tomorrow - been meaning
> to
> >> do it
> >> >>>> for
> >> >>>> >>>> my own convenience anyhow.
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> Also, you want to pass the sys property numShards=1 on
> >> startup. I
> >> >>>> >>>> think it defaults to 3. That will give you one leader and one
> >> replica.
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> - Mark
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I
> >> used the
> >> >>>> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I
> setup
> >> a 2
> >> >>>> shard
> >> >>>> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards,
> >> updated
> >> >>>> it
> >> >>>> >>>>>>>>> and sent the update to the other.  I don't see the
> >> modifications
> >> >>>> >>>>>>>>> though I only see the original document.  The following
> is
> >> the
> >> >>>> test
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>> public void update() throws Exception {
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              String key = "1";
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              SolrInputDocument solrDoc = new
> >> SolrInputDocument();
> >> >>>> >>>>>>>>>              solrDoc.setField("key", key);
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              solrDoc.addField("content", "initial
> value");
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              SolrServer server = servers
> >> >>>> >>>>>>>>>                              .get("
> >> >>>> >>>> http://localhost:8983/solr/collection1");
> >> >>>> >>>>>>>>>              server.add(solrDoc);
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              server.commit();
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              solrDoc = new SolrInputDocument();
> >> >>>> >>>>>>>>>              solrDoc.addField("key", key);
> >> >>>> >>>>>>>>>              solrDoc.addField("content", "updated
> value");
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              server = servers.get("
> >> >>>> >>>> http://localhost:7574/solr/collection1");
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
> >> >>>> >>>>>>>>>              ureq.setParam("update.chain",
> >> >>>> "distrib-update-chain");
> >> >>>> >>>>>>>>>              ureq.add(solrDoc);
> >> >>>> >>>>>>>>>              ureq.setParam("shards",
> >> >>>> >>>>>>>>>
> >> >>>> >>>>
> >>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >> >>>> >>>>>>>>>              ureq.setParam("self", "foo");
> >> >>>> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
> >> >>>> >>>>>>>>>              server.request(ureq);
> >> >>>> >>>>>>>>>              System.out.println("done");
> >> >>>> >>>>>>>>>      }
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>> key is my unique field in schema.xml
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>> What am I doing wrong?
> >> >>>> >>>>>>>>>
> >> >>>> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <
> >> jej2003@gmail.com
> >> >>>> >
> >> >>>> >>>> wrote:
> >> >>>> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a
> new
> >> shard
> >> >>>> >>>> would
> >> >>>> >>>>>>>>>> be simply updating the range assignments in ZK.  Where
> is
> >> this
> >> >>>> >>>>>>>>>> currently on the list of things to accomplish?  I don't
> >> have
> >> >>>> time to
> >> >>>> >>>>>>>>>> work on this now, but if you (or anyone) could provide
> >> >>>> direction I'd
> >> >>>> >>>>>>>>>> be willing to work on this when I had spare time.  I
> guess
> >> a
> >> >>>> JIRA
> >> >>>> >>>>>>>>>> detailing where/how to do this could help.  Not sure if
> the
> >> >>>> design
> >> >>>> >>>> has
> >> >>>> >>>>>>>>>> been thought out that far though.
> >> >>>> >>>>>>>>>>
> >> >>>> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
> >> >>>> markrmiller@gmail.com>
> >> >>>> >>>> wrote:
> >> >>>> >>>>>>>>>>> Right now lets say you have one shard - everything
> there
> >> >>>> hashes to
> >> >>>> >>>> range X.
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> Now you want to split that shard with an Index
> Splitter.
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> You divide range X in two - giving you two ranges -
> then
> >> you
> >> >>>> start
> >> >>>> >>>> splitting. This is where the current Splitter needs a little
> >> >>>> modification.
> >> >>>> >>>> You decide which doc should go into which new index by
> rehashing
> >> each
> >> >>>> doc
> >> >>>> >>>> id in the index you are splitting - if its hash is greater
> than
> >> X/2,
> >> >>>> it
> >> >>>> >>>> goes into index1 - if its less, index2. I think there are a
> >> couple
> >> >>>> current
> >> >>>> >>>> Splitter impls, but one of them does something like, give me
> an
> >> id -
> >> >>>> now if
> >> >>>> >>>> the id's in the index are above that id, goto index1, if
> below,
> >> >>>> index2. We
> >> >>>> >>>> need to instead do a quick hash rather than simple id compare.
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> Why do you need to do this on every shard?
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> The other part we need that we dont have is to store
> hash
> >> range
> >> >>>> >>>> assignments in zookeeper - we don't do that yet because it's
> not
> >> >>>> needed
> >> >>>> >>>> yet. Instead we currently just simply calculate that on the
> fly
> >> (too
> >> >>>> often
> >> >>>> >>>> at the moment - on every request :) I intend to fix that of
> >> course).
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> At the start, zk would say, for range X, goto this
> shard.
> >> After
> >> >>>> >>>> the split, it would say, for range less than X/2 goto the old
> >> node,
> >> >>>> for
> >> >>>> >>>> range greater than X/2 goto the new node.
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> - Mark
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm
> >> that's
> >> >>>> on
> >> >>>> >>>> the
> >> >>>> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds
> >> like
> >> >>>> there
> >> >>>> >>>> is
> >> >>>> >>>>>>>>>>>> some logic which is able to tell that a particular
> range
> >> >>>> should be
> >> >>>> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems
> >> like a
> >> >>>> trade
> >> >>>> >>>> off
> >> >>>> >>>>>>>>>>>> between repartitioning the entire index (on every
> shard)
> >> and
> >> >>>> >>>> having a
> >> >>>> >>>>>>>>>>>> custom hashing algorithm which is able to handle the
> >> situation
> >> >>>> >>>> where 2
> >> >>>> >>>>>>>>>>>> or more shards map to a particular range.
> >> >>>> >>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
> >> >>>> >>>> markrmiller@gmail.com> wrote:
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
> >> >>>> contrib,
> >> >>>> >>>> but I'll
> >> >>>> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like
> it
> >> >>>> would be
> >> >>>> >>>> to run
> >> >>>> >>>>>>>>>>>>>> this on all of the current shards indexes based on
> the
> >> hash
> >> >>>> >>>> algorithm.
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> Not something I've thought deeply about myself yet,
> but
> >> I
> >> >>>> think
> >> >>>> >>>> the idea would be to split as many as you felt you needed to.
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> If you wanted to keep the full balance always, this
> >> would
> >> >>>> mean
> >> >>>> >>>> splitting every shard at once, yes. But this depends on how
> many
> >> boxes
> >> >>>> >>>> (partitions) you are willing/able to add at a time.
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> You might just split one index to start - now it's
> hash
> >> range
> >> >>>> >>>> would be handled by two shards instead of one (if you have 3
> >> replicas
> >> >>>> per
> >> >>>> >>>> shard, this would mean adding 3 more boxes). When you needed
> to
> >> expand
> >> >>>> >>>> again, you would split another index that was still handling
> its
> >> full
> >> >>>> >>>> starting range. As you grow, once you split every original
> index,
> >> >>>> you'd
> >> >>>> >>>> start again, splitting one of the now half ranges.
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>> Is there also an index merger in contrib which
> could be
> >> >>>> used to
> >> >>>> >>>> merge
> >> >>>> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also
> >> has an
> >> >>>> >>>> admin command that can do this). But I'm not sure where this
> >> fits in?
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> - Mark
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
> >> >>>> >>>> markrmiller@gmail.com> wrote:
> >> >>>> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a
> >> lot of
> >> >>>> >>>> other stuff is
> >> >>>> >>>>>>>>>>>>>>> working solid at this point. But someone else could
> >> jump
> >> >>>> in!
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know
> of:
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> A more long term solution may be to start using
> micro
> >> >>>> shards -
> >> >>>> >>>> each index
> >> >>>> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty
> fast
> >> to
> >> >>>> move
> >> >>>> >>>> mirco shards
> >> >>>> >>>>>>>>>>>>>>> around as you decide to change partitions. It's
> also
> >> less
> >> >>>> >>>> flexible as you
> >> >>>> >>>>>>>>>>>>>>> are limited by the number of micro shards you start
> >> with.
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> A more simple and likely first step is to use an
> index
> >> >>>> >>>> splitter . We
> >> >>>> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just
> >> need to
> >> >>>> >>>> modify it so
> >> >>>> >>>>>>>>>>>>>>> that it splits based on the hash of the document
> id.
> >> This
> >> >>>> is
> >> >>>> >>>> super
> >> >>>> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a
> little
> >> while
> >> >>>> on
> >> >>>> >>>> a huge index.
> >> >>>> >>>>>>>>>>>>>>> The current index splitter is a multi pass
> splitter -
> >> good
> >> >>>> >>>> enough to start
> >> >>>> >>>>>>>>>>>>>>> with, but most files under codec control these
> days,
> >> we
> >> >>>> may be
> >> >>>> >>>> able to make
> >> >>>> >>>>>>>>>>>>>>> a single pass splitter soon as well.
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> Eventually you could imagine using both options -
> >> micro
> >> >>>> shards
> >> >>>> >>>> that could
> >> >>>> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if
> >> micro
> >> >>>> shards
> >> >>>> >>>> will be
> >> >>>> >>>>>>>>>>>>>>> worth the extra complications myself...
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick
> a
> >> good
> >> >>>> >>>> number of
> >> >>>> >>>>>>>>>>>>>>> partitions to start given your expected data ;)
> >> Adding more
> >> >>>> >>>> replicas is
> >> >>>> >>>>>>>>>>>>>>> trivial though.
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> - Mark
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
> >> >>>> >>>> jej2003@gmail.com> wrote:
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>> Another question, is there any support for
> >> repartitioning
> >> >>>> of
> >> >>>> >>>> the index
> >> >>>> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
> >> >>>> approach for
> >> >>>> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing
> algorithm
> >> (and
> >> >>>> >>>> probably
> >> >>>> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned
> >> should a
> >> >>>> new
> >> >>>> >>>> shard be
> >> >>>> >>>>>>>>>>>>>>>> added.
> >> >>>> >>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
> >> >>>> >>>> jej2003@gmail.com> wrote:
> >> >>>> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the
> morning.
> >> >>>> >>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
> >> >>>> >>>> markrmiller@gmail.com>
> >> >>>> >>>>>>>>>>>>>>>> wrote:
> >> >>>> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
> >> >>>> >>>> jej2003@gmail.com>
> >> >>>> >>>>>>>>>>>>>>>> wrote:
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud
> >> branch
> >> >>>> and
> >> >>>> >>>> was
> >> >>>> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
> >> >>>> configuring the
> >> >>>> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically
> in
> >> >>>> >>>> solrconfig.xml needs
> >> >>>> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed
> indexing
> >> work?
> >> >>>> >>>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at
> >> solrconfig-distrib-update.xml
> >> >>>> in
> >> >>>> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
> >> >>>> replication
> >> >>>> >>>> handler def,
> >> >>>> >>>>>>>>>>>>>>>>>> and an update chain with
> >> >>>> >>>> solr.DistributedUpdateProcessFactory in it.
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>> --
> >> >>>> >>>>>>>>>>>>>>>>>> - Mark
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
> >> >>>> >>>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> --
> >> >>>> >>>>>>>>>>>>>>> - Mark
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>>> http://www.lucidimagination.com
> >> >>>> >>>>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>> - Mark Miller
> >> >>>> >>>>>>>>>>>>> lucidimagination.com
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>> - Mark Miller
> >> >>>> >>>>>>>>>>> lucidimagination.com
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>>
> >> >>>> >>>>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>> - Mark Miller
> >> >>>> >>>>>>>> lucidimagination.com
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>> - Mark Miller
> >> >>>> >>>>>> lucidimagination.com
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>>
> >> >>>> >>>>>
> >> >>>> >>>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>> --
> >> >>>> >>> - Mark
> >> >>>> >>>
> >> >>>> >>> http://www.lucidimagination.com
> >> >>>> >>>
> >> >>>> >
> >> >>>> > - Mark Miller
> >> >>>> > lucidimagination.com
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> - Mark
> >> >>>
> >> >>> http://www.lucidimagination.com
> >>
> >
> >
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Again great stuff.  Once distributed update/delete works (sounds like
it's not far off) I'll have to reevaluate our current stack.

You had mentioned storing the shad hash assignments in ZK, is there a
JIRA around this?

I'll keep my eyes on the JIRA tickets.  Right now the distirbuted
updated/delete are big ones for me, the rebalancing of the cluster is
a nice to have, but hopefully I wouldn't need that capability anytime
in the near future.

One other question....my current setup has replication done on a
polling setup, I didn't notice that in the updated solrconfig, how
does this work now?

On Sat, Dec 3, 2011 at 9:00 AM, Mark Miller <ma...@gmail.com> wrote:
> bq. A few questions if a master goes down does a replica get
> promoted?
>
> Right - if the leader goes down there is a leader election and one of the
> replicas takes over.
>
> bq.  If a new shard needs to be added is it just a matter of
> starting a new solr instance with a higher numShards?
>
> Eventually, that's the plan.
>
> The idea is, you say something like, I want 3 shards. Now if you start up 9
> instances, the first 3 end up as shard leaders - the next 6 evenly come up
> as replicas for each shard.
>
> To change the numShards, we will need some kind of micro shards / splitting
> / rebalancing.
>
> bq. Last question, how do you change numShards?
>
> I think this is somewhat a work in progress, but I think Sami just made it
> so that numShards is stored on the collection node in zk (along with which
> config set to use). So you would change it there presumably. Or perhaps
> just start up a new server with an update numShards property and then it
> would realize that needs to be a new leader - of course then you'd want to
> rebalance probably - unless you fired up enough servers to add replicas
> too...
>
> bq. Is that right?
>
> Yup, sounds about right.
>
>
> On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> So I just tried this out, seems like it does the things I asked about.
>>
>> Really really cool stuff, it's progressed quite a bit in the time
>> since I took a snapshot of the branch.
>>
>> Last question, how do you change numShards?  Is there a command you
>> can use to do this now? I understand there will be implications for
>> the hashing algorithm, but once the hash ranges are stored in ZK (is
>> there a separate JIRA for this or does this fall under 2358) I assume
>> that it would be a relatively simple index split (JIRA 2595?) and
>> updating the hash ranges in solr, essentially splitting the range
>> between the new and existing shard.  Is that right?
>>
>> On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson <je...@gmail.com> wrote:
>> > I think I see it.....so if I understand this correctly you specify
>> > numShards as a system property, as new nodes come up they check ZK to
>> > see if they should be a new shard or a replica based on if numShards
>> > is met.  A few questions if a master goes down does a replica get
>> > promoted?  If a new shard needs to be added is it just a matter of
>> > starting a new solr instance with a higher numShards?  (understanding
>> > that index rebalancing does not happen automatically now, but
>> > presumably it could).
>> >
>> > On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >> How does it determine the number of shards to create?  How many
>> >> replicas to create?
>> >>
>> >> On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>> Ah, okay - you are setting the shards in solr.xml - thats still an
>> option
>> >>> to force a node to a particular shard - but if you take that out,
>> shards
>> >>> will be auto assigned.
>> >>>
>> >>> By the way, because of the version code, distrib deletes don't work at
>> the
>> >>> moment - will get to that next week.
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>
>> >>>> So I'm a fool.  I did set the numShards, the issue was so trivial it's
>> >>>> embarrassing.  I did indeed have it setup as a replica, the shard
>> >>>> names in solr.xml were both shard1.  This worked as I expected now.
>> >>>>
>> >>>> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>> >
>> >>>> > They are unused params, so removing them wouldn't help anything.
>> >>>> >
>> >>>> > You might just want to wait till we are further along before playing
>> >>>> with it.
>> >>>> >
>> >>>> > Or if you submit your full self contained test, I can see what's
>> going
>> >>>> on (eg its still unclear if you have started setting numShards?).
>> >>>> >
>> >>>> > I can do a similar set of actions in my tests and it works fine. The
>> >>>> only reason I could see things working like this is if it thinks you
>> have
>> >>>> one shard - a leader and a replica.
>> >>>> >
>> >>>> > - Mark
>> >>>> >
>> >>>> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
>> >>>> >
>> >>>> >> Glad to hear I don't need to set shards/self, but removing them
>> didn't
>> >>>> >> seem to change what I'm seeing.  Doing this still results in 2
>> >>>> >> documents 1 on 8983 and 1 on 7574.
>> >>>> >>
>> >>>> >> String key = "1";
>> >>>> >>
>> >>>> >>               SolrInputDocument solrDoc = new SolrInputDocument();
>> >>>> >>               solrDoc.setField("key", key);
>> >>>> >>
>> >>>> >>               solrDoc.addField("content_mvtxt", "initial value");
>> >>>> >>
>> >>>> >>               SolrServer server = servers.get("
>> >>>> http://localhost:8983/solr/collection1");
>> >>>> >>
>> >>>> >>               UpdateRequest ureq = new UpdateRequest();
>> >>>> >>               ureq.setParam("update.chain",
>> "distrib-update-chain");
>> >>>> >>               ureq.add(solrDoc);
>> >>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>> >>>> >>               server.request(ureq);
>> >>>> >>               server.commit();
>> >>>> >>
>> >>>> >>               solrDoc = new SolrInputDocument();
>> >>>> >>               solrDoc.addField("key", key);
>> >>>> >>               solrDoc.addField("content_mvtxt", "updated value");
>> >>>> >>
>> >>>> >>               server = servers.get("
>> >>>> http://localhost:7574/solr/collection1");
>> >>>> >>
>> >>>> >>               ureq = new UpdateRequest();
>> >>>> >>               ureq.setParam("update.chain",
>> "distrib-update-chain");
>> >>>> >>               ureq.add(solrDoc);
>> >>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>> >>>> >>               server.request(ureq);
>> >>>> >>               server.commit();
>> >>>> >>
>> >>>> >>               server = servers.get("
>> >>>> http://localhost:8983/solr/collection1");
>> >>>> >>
>> >>>> >>
>> >>>> >>               server.commit();
>> >>>> >>               System.out.println("done");
>> >>>> >>
>> >>>> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <
>> markrmiller@gmail.com>
>> >>>> wrote:
>> >>>> >>> So I dunno. You are running a zk server and running in zk mode
>> right?
>> >>>> >>>
>> >>>> >>> You don't need to / shouldn't set a shards or self param. The
>> shards
>> >>>> are
>> >>>> >>> figured out from Zookeeper.
>> >>>> >>>
>> >>>> >>> You always want to use the distrib-update-chain. Eventually it
>> will
>> >>>> >>> probably be part of the default chain and auto turn in zk mode.
>> >>>> >>>
>> >>>> >>> If you are running in zk mode attached to a zk server, this should
>> >>>> work no
>> >>>> >>> problem. You can add docs to any server and they will be
>> forwarded to
>> >>>> the
>> >>>> >>> correct shard leader and then versioned and forwarded to replicas.
>> >>>> >>>
>> >>>> >>> You can also use the CloudSolrServer solrj client - that way you
>> don't
>> >>>> even
>> >>>> >>> have to choose a server to send docs too - in which case if it
>> went
>> >>>> down
>> >>>> >>> you would have to choose another manually - CloudSolrServer
>> >>>> automatically
>> >>>> >>> finds one that is up through ZooKeeper. Eventually it will also be
>> >>>> smart
>> >>>> >>> and do the hashing itself so that it can send directly to the
>> shard
>> >>>> leader
>> >>>> >>> that the doc would be forwarded to anyway.
>> >>>> >>>
>> >>>> >>> - Mark
>> >>>> >>>
>> >>>> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <jej2003@gmail.com
>> >
>> >>>> wrote:
>> >>>> >>>
>> >>>> >>>> Really just trying to do a simple add and update test, the chain
>> >>>> >>>> missing is just proof of my not understanding exactly how this is
>> >>>> >>>> supposed to work.  I modified the code to this
>> >>>> >>>>
>> >>>> >>>>                String key = "1";
>> >>>> >>>>
>> >>>> >>>>                SolrInputDocument solrDoc = new
>> SolrInputDocument();
>> >>>> >>>>                solrDoc.setField("key", key);
>> >>>> >>>>
>> >>>> >>>>                 solrDoc.addField("content_mvtxt", "initial
>> value");
>> >>>> >>>>
>> >>>> >>>>                SolrServer server = servers
>> >>>> >>>>                                .get("
>> >>>> >>>> http://localhost:8983/solr/collection1");
>> >>>> >>>>
>> >>>> >>>>                 UpdateRequest ureq = new UpdateRequest();
>> >>>> >>>>                ureq.setParam("update.chain",
>> "distrib-update-chain");
>> >>>> >>>>                ureq.add(solrDoc);
>> >>>> >>>>                ureq.setParam("shards",
>> >>>> >>>>
>> >>>> >>>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>> >>>>                ureq.setParam("self", "foo");
>> >>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>> >>>> >>>>                server.request(ureq);
>> >>>> >>>>                 server.commit();
>> >>>> >>>>
>> >>>> >>>>                solrDoc = new SolrInputDocument();
>> >>>> >>>>                solrDoc.addField("key", key);
>> >>>> >>>>                 solrDoc.addField("content_mvtxt", "updated
>> value");
>> >>>> >>>>
>> >>>> >>>>                server = servers.get("
>> >>>> >>>> http://localhost:7574/solr/collection1");
>> >>>> >>>>
>> >>>> >>>>                 ureq = new UpdateRequest();
>> >>>> >>>>                ureq.setParam("update.chain",
>> "distrib-update-chain");
>> >>>> >>>>                 //
>> >>>> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>> >>>> >>>>                 ureq.add(solrDoc);
>> >>>> >>>>                ureq.setParam("shards",
>> >>>> >>>>
>> >>>> >>>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>> >>>>                ureq.setParam("self", "foo");
>> >>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>> >>>> >>>>                server.request(ureq);
>> >>>> >>>>                 // server.add(solrDoc);
>> >>>> >>>>                server.commit();
>> >>>> >>>>                server = servers.get("
>> >>>> >>>> http://localhost:8983/solr/collection1");
>> >>>> >>>>
>> >>>> >>>>
>> >>>> >>>>                server.commit();
>> >>>> >>>>                System.out.println("done");
>> >>>> >>>>
>> >>>> >>>> but I'm still seeing the doc appear on both shards.    After the
>> first
>> >>>> >>>> commit I see the doc on 8983 with "initial value".  after the
>> second
>> >>>> >>>> commit I see the updated value on 7574 and the old on 8983.
>>  After the
>> >>>> >>>> final commit the doc on 8983 gets updated.
>> >>>> >>>>
>> >>>> >>>> Is there something wrong with my test?
>> >>>> >>>>
>> >>>> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <
>> markrmiller@gmail.com>
>> >>>> >>>> wrote:
>> >>>> >>>>> Getting late - didn't really pay attention to your code I guess
>> - why
>> >>>> >>>> are you adding the first doc without specifying the distrib
>> update
>> >>>> chain?
>> >>>> >>>> This is not really supported. It's going to just go to the
>> server you
>> >>>> >>>> specified - even with everything setup right, the update might
>> then
>> >>>> go to
>> >>>> >>>> that same server or the other one depending on how it hashes. You
>> >>>> really
>> >>>> >>>> want to just always use the distrib update chain.  I guess I
>> don't yet
>> >>>> >>>> understand what you are trying to test.
>> >>>> >>>>>
>> >>>> >>>>> Sent from my iPad
>> >>>> >>>>>
>> >>>> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <markrmiller@gmail.com
>> >
>> >>>> wrote:
>> >>>> >>>>>
>> >>>> >>>>>> Not sure offhand - but things will be funky if you don't
>> specify the
>> >>>> >>>> correct numShards.
>> >>>> >>>>>>
>> >>>> >>>>>> The instance to shard assignment should be using numShards to
>> >>>> assign.
>> >>>> >>>> But then the hash to shard mapping actually goes on the number of
>> >>>> shards it
>> >>>> >>>> finds registered in ZK (it doesn't have to, but really these
>> should be
>> >>>> >>>> equal).
>> >>>> >>>>>>
>> >>>> >>>>>> So basically you are saying, I want 3 partitions, but you are
>> only
>> >>>> >>>> starting up 2 nodes, and the code is just not happy about that
>> I'd
>> >>>> guess.
>> >>>> >>>> For the system to work properly, you have to fire up at least as
>> many
>> >>>> >>>> servers as numShards.
>> >>>> >>>>>>
>> >>>> >>>>>> What are you trying to do? 2 partitions with no replicas, or
>> one
>> >>>> >>>> partition with one replica?
>> >>>> >>>>>>
>> >>>> >>>>>> In either case, I think you will have better luck if you fire
>> up at
>> >>>> >>>> least as many servers as the numShards setting. Or lower the
>> numShards
>> >>>> >>>> setting.
>> >>>> >>>>>>
>> >>>> >>>>>> This is all a work in progress by the way - what you are
>> trying to
>> >>>> test
>> >>>> >>>> should work if things are setup right though.
>> >>>> >>>>>>
>> >>>> >>>>>> - Mark
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>> >>>> >>>>>>
>> >>>> >>>>>>> Thanks for the quick response.  With that change (have not
>> done
>> >>>> >>>>>>> numShards yet) shard1 got updated.  But now when executing the
>> >>>> >>>>>>> following queries I get information back from both, which
>> doesn't
>> >>>> seem
>> >>>> >>>>>>> right
>> >>>> >>>>>>>
>> >>>> >>>>>>> http://localhost:7574/solr/select/?q=*:*
>> >>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> >>>> >>>> value</str></doc>
>> >>>> >>>>>>>
>> >>>> >>>>>>> http://localhost:8983/solr/select?q=*:*
>> >>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> >>>> >>>> value</str></doc>
>> >>>> >>>>>>>
>> >>>> >>>>>>>
>> >>>> >>>>>>>
>> >>>> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
>> >>>> markrmiller@gmail.com>
>> >>>> >>>> wrote:
>> >>>> >>>>>>>> Hmm...sorry bout that - so my first guess is that right now
>> we are
>> >>>> >>>> not distributing a commit (easy to add, just have not done it).
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> Right now I explicitly commit on each server for tests.
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> Can you try explicitly committing on server1 after updating
>> the
>> >>>> doc
>> >>>> >>>> on server 2?
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> I can start distributing commits tomorrow - been meaning to
>> do it
>> >>>> for
>> >>>> >>>> my own convenience anyhow.
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> Also, you want to pass the sys property numShards=1 on
>> startup. I
>> >>>> >>>> think it defaults to 3. That will give you one leader and one
>> replica.
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> - Mark
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I
>> used the
>> >>>> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup
>> a 2
>> >>>> shard
>> >>>> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards,
>> updated
>> >>>> it
>> >>>> >>>>>>>>> and sent the update to the other.  I don't see the
>> modifications
>> >>>> >>>>>>>>> though I only see the original document.  The following is
>> the
>> >>>> test
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> public void update() throws Exception {
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              String key = "1";
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              SolrInputDocument solrDoc = new
>> SolrInputDocument();
>> >>>> >>>>>>>>>              solrDoc.setField("key", key);
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              solrDoc.addField("content", "initial value");
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              SolrServer server = servers
>> >>>> >>>>>>>>>                              .get("
>> >>>> >>>> http://localhost:8983/solr/collection1");
>> >>>> >>>>>>>>>              server.add(solrDoc);
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              server.commit();
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              solrDoc = new SolrInputDocument();
>> >>>> >>>>>>>>>              solrDoc.addField("key", key);
>> >>>> >>>>>>>>>              solrDoc.addField("content", "updated value");
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              server = servers.get("
>> >>>> >>>> http://localhost:7574/solr/collection1");
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>> >>>> >>>>>>>>>              ureq.setParam("update.chain",
>> >>>> "distrib-update-chain");
>> >>>> >>>>>>>>>              ureq.add(solrDoc);
>> >>>> >>>>>>>>>              ureq.setParam("shards",
>> >>>> >>>>>>>>>
>> >>>> >>>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>> >>>>>>>>>              ureq.setParam("self", "foo");
>> >>>> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>> >>>> >>>>>>>>>              server.request(ureq);
>> >>>> >>>>>>>>>              System.out.println("done");
>> >>>> >>>>>>>>>      }
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> key is my unique field in schema.xml
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> What am I doing wrong?
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <
>> jej2003@gmail.com
>> >>>> >
>> >>>> >>>> wrote:
>> >>>> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new
>> shard
>> >>>> >>>> would
>> >>>> >>>>>>>>>> be simply updating the range assignments in ZK.  Where is
>> this
>> >>>> >>>>>>>>>> currently on the list of things to accomplish?  I don't
>> have
>> >>>> time to
>> >>>> >>>>>>>>>> work on this now, but if you (or anyone) could provide
>> >>>> direction I'd
>> >>>> >>>>>>>>>> be willing to work on this when I had spare time.  I guess
>> a
>> >>>> JIRA
>> >>>> >>>>>>>>>> detailing where/how to do this could help.  Not sure if the
>> >>>> design
>> >>>> >>>> has
>> >>>> >>>>>>>>>> been thought out that far though.
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
>> >>>> markrmiller@gmail.com>
>> >>>> >>>> wrote:
>> >>>> >>>>>>>>>>> Right now lets say you have one shard - everything there
>> >>>> hashes to
>> >>>> >>>> range X.
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> You divide range X in two - giving you two ranges - then
>> you
>> >>>> start
>> >>>> >>>> splitting. This is where the current Splitter needs a little
>> >>>> modification.
>> >>>> >>>> You decide which doc should go into which new index by rehashing
>> each
>> >>>> doc
>> >>>> >>>> id in the index you are splitting - if its hash is greater than
>> X/2,
>> >>>> it
>> >>>> >>>> goes into index1 - if its less, index2. I think there are a
>> couple
>> >>>> current
>> >>>> >>>> Splitter impls, but one of them does something like, give me an
>> id -
>> >>>> now if
>> >>>> >>>> the id's in the index are above that id, goto index1, if below,
>> >>>> index2. We
>> >>>> >>>> need to instead do a quick hash rather than simple id compare.
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> Why do you need to do this on every shard?
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> The other part we need that we dont have is to store hash
>> range
>> >>>> >>>> assignments in zookeeper - we don't do that yet because it's not
>> >>>> needed
>> >>>> >>>> yet. Instead we currently just simply calculate that on the fly
>> (too
>> >>>> often
>> >>>> >>>> at the moment - on every request :) I intend to fix that of
>> course).
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> At the start, zk would say, for range X, goto this shard.
>> After
>> >>>> >>>> the split, it would say, for range less than X/2 goto the old
>> node,
>> >>>> for
>> >>>> >>>> range greater than X/2 goto the new node.
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> - Mark
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm
>> that's
>> >>>> on
>> >>>> >>>> the
>> >>>> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds
>> like
>> >>>> there
>> >>>> >>>> is
>> >>>> >>>>>>>>>>>> some logic which is able to tell that a particular range
>> >>>> should be
>> >>>> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems
>> like a
>> >>>> trade
>> >>>> >>>> off
>> >>>> >>>>>>>>>>>> between repartitioning the entire index (on every shard)
>> and
>> >>>> >>>> having a
>> >>>> >>>>>>>>>>>> custom hashing algorithm which is able to handle the
>> situation
>> >>>> >>>> where 2
>> >>>> >>>>>>>>>>>> or more shards map to a particular range.
>> >>>> >>>>>>>>>>>>
>> >>>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>> >>>> >>>> markrmiller@gmail.com> wrote:
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
>> >>>> contrib,
>> >>>> >>>> but I'll
>> >>>> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it
>> >>>> would be
>> >>>> >>>> to run
>> >>>> >>>>>>>>>>>>>> this on all of the current shards indexes based on the
>> hash
>> >>>> >>>> algorithm.
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> Not something I've thought deeply about myself yet, but
>> I
>> >>>> think
>> >>>> >>>> the idea would be to split as many as you felt you needed to.
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> If you wanted to keep the full balance always, this
>> would
>> >>>> mean
>> >>>> >>>> splitting every shard at once, yes. But this depends on how many
>> boxes
>> >>>> >>>> (partitions) you are willing/able to add at a time.
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> You might just split one index to start - now it's hash
>> range
>> >>>> >>>> would be handled by two shards instead of one (if you have 3
>> replicas
>> >>>> per
>> >>>> >>>> shard, this would mean adding 3 more boxes). When you needed to
>> expand
>> >>>> >>>> again, you would split another index that was still handling its
>> full
>> >>>> >>>> starting range. As you grow, once you split every original index,
>> >>>> you'd
>> >>>> >>>> start again, splitting one of the now half ranges.
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>> Is there also an index merger in contrib which could be
>> >>>> used to
>> >>>> >>>> merge
>> >>>> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also
>> has an
>> >>>> >>>> admin command that can do this). But I'm not sure where this
>> fits in?
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> - Mark
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>> >>>> >>>> markrmiller@gmail.com> wrote:
>> >>>> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a
>> lot of
>> >>>> >>>> other stuff is
>> >>>> >>>>>>>>>>>>>>> working solid at this point. But someone else could
>> jump
>> >>>> in!
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> A more long term solution may be to start using micro
>> >>>> shards -
>> >>>> >>>> each index
>> >>>> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast
>> to
>> >>>> move
>> >>>> >>>> mirco shards
>> >>>> >>>>>>>>>>>>>>> around as you decide to change partitions. It's also
>> less
>> >>>> >>>> flexible as you
>> >>>> >>>>>>>>>>>>>>> are limited by the number of micro shards you start
>> with.
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>> >>>> >>>> splitter . We
>> >>>> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just
>> need to
>> >>>> >>>> modify it so
>> >>>> >>>>>>>>>>>>>>> that it splits based on the hash of the document id.
>> This
>> >>>> is
>> >>>> >>>> super
>> >>>> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a little
>> while
>> >>>> on
>> >>>> >>>> a huge index.
>> >>>> >>>>>>>>>>>>>>> The current index splitter is a multi pass splitter -
>> good
>> >>>> >>>> enough to start
>> >>>> >>>>>>>>>>>>>>> with, but most files under codec control these days,
>> we
>> >>>> may be
>> >>>> >>>> able to make
>> >>>> >>>>>>>>>>>>>>> a single pass splitter soon as well.
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> Eventually you could imagine using both options -
>> micro
>> >>>> shards
>> >>>> >>>> that could
>> >>>> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if
>> micro
>> >>>> shards
>> >>>> >>>> will be
>> >>>> >>>>>>>>>>>>>>> worth the extra complications myself...
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick a
>> good
>> >>>> >>>> number of
>> >>>> >>>>>>>>>>>>>>> partitions to start given your expected data ;)
>> Adding more
>> >>>> >>>> replicas is
>> >>>> >>>>>>>>>>>>>>> trivial though.
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> - Mark
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>> >>>> >>>> jej2003@gmail.com> wrote:
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>> Another question, is there any support for
>> repartitioning
>> >>>> of
>> >>>> >>>> the index
>> >>>> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
>> >>>> approach for
>> >>>> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm
>> (and
>> >>>> >>>> probably
>> >>>> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned
>> should a
>> >>>> new
>> >>>> >>>> shard be
>> >>>> >>>>>>>>>>>>>>>> added.
>> >>>> >>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>> >>>> >>>> jej2003@gmail.com> wrote:
>> >>>> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>> >>>> >>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>> >>>> >>>> markrmiller@gmail.com>
>> >>>> >>>>>>>>>>>>>>>> wrote:
>> >>>> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>> >>>> >>>> jej2003@gmail.com>
>> >>>> >>>>>>>>>>>>>>>> wrote:
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud
>> branch
>> >>>> and
>> >>>> >>>> was
>> >>>> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
>> >>>> configuring the
>> >>>> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>> >>>> >>>> solrconfig.xml needs
>> >>>> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing
>> work?
>> >>>> >>>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at
>> solrconfig-distrib-update.xml
>> >>>> in
>> >>>> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
>> >>>> replication
>> >>>> >>>> handler def,
>> >>>> >>>>>>>>>>>>>>>>>> and an update chain with
>> >>>> >>>> solr.DistributedUpdateProcessFactory in it.
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>> --
>> >>>> >>>>>>>>>>>>>>>>>> - Mark
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>> >>>> >>>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> --
>> >>>> >>>>>>>>>>>>>>> - Mark
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>>> http://www.lucidimagination.com
>> >>>> >>>>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>> - Mark Miller
>> >>>> >>>>>>>>>>>>> lucidimagination.com
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>> - Mark Miller
>> >>>> >>>>>>>>>>> lucidimagination.com
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>>
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> - Mark Miller
>> >>>> >>>>>>>> lucidimagination.com
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>> - Mark Miller
>> >>>> >>>>>> lucidimagination.com
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>
>> >>>> >>>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> --
>> >>>> >>> - Mark
>> >>>> >>>
>> >>>> >>> http://www.lucidimagination.com
>> >>>> >>>
>> >>>> >
>> >>>> > - Mark Miller
>> >>>> > lucidimagination.com
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> - Mark
>> >>>
>> >>> http://www.lucidimagination.com
>>
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

bq. A few questions if a master goes down does a replica get
promoted?

Right - if the leader goes down there is a leader election and one of the
replicas takes over.

bq.  If a new shard needs to be added is it just a matter of
starting a new solr instance with a higher numShards?

Eventually, that's the plan.

The idea is, you say something like, I want 3 shards. Now if you start up 9
instances, the first 3 end up as shard leaders - the next 6 evenly come up
as replicas for each shard.

To change the numShards, we will need some kind of micro shards / splitting
/ rebalancing.

bq. Last question, how do you change numShards?

I think this is somewhat a work in progress, but I think Sami just made it
so that numShards is stored on the collection node in zk (along with which
config set to use). So you would change it there presumably. Or perhaps
just start up a new server with an update numShards property and then it
would realize that needs to be a new leader - of course then you'd want to
rebalance probably - unless you fired up enough servers to add replicas
too...

bq. Is that right?

Yup, sounds about right.


On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson <je...@gmail.com> wrote:

> So I just tried this out, seems like it does the things I asked about.
>
> Really really cool stuff, it's progressed quite a bit in the time
> since I took a snapshot of the branch.
>
> Last question, how do you change numShards?  Is there a command you
> can use to do this now? I understand there will be implications for
> the hashing algorithm, but once the hash ranges are stored in ZK (is
> there a separate JIRA for this or does this fall under 2358) I assume
> that it would be a relatively simple index split (JIRA 2595?) and
> updating the hash ranges in solr, essentially splitting the range
> between the new and existing shard.  Is that right?
>
> On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson <je...@gmail.com> wrote:
> > I think I see it.....so if I understand this correctly you specify
> > numShards as a system property, as new nodes come up they check ZK to
> > see if they should be a new shard or a replica based on if numShards
> > is met.  A few questions if a master goes down does a replica get
> > promoted?  If a new shard needs to be added is it just a matter of
> > starting a new solr instance with a higher numShards?  (understanding
> > that index rebalancing does not happen automatically now, but
> > presumably it could).
> >
> > On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson <je...@gmail.com> wrote:
> >> How does it determine the number of shards to create?  How many
> >> replicas to create?
> >>
> >> On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>> Ah, okay - you are setting the shards in solr.xml - thats still an
> option
> >>> to force a node to a particular shard - but if you take that out,
> shards
> >>> will be auto assigned.
> >>>
> >>> By the way, because of the version code, distrib deletes don't work at
> the
> >>> moment - will get to that next week.
> >>>
> >>> - Mark
> >>>
> >>> On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>
> >>>> So I'm a fool.  I did set the numShards, the issue was so trivial it's
> >>>> embarrassing.  I did indeed have it setup as a replica, the shard
> >>>> names in solr.xml were both shard1.  This worked as I expected now.
> >>>>
> >>>> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>> >
> >>>> > They are unused params, so removing them wouldn't help anything.
> >>>> >
> >>>> > You might just want to wait till we are further along before playing
> >>>> with it.
> >>>> >
> >>>> > Or if you submit your full self contained test, I can see what's
> going
> >>>> on (eg its still unclear if you have started setting numShards?).
> >>>> >
> >>>> > I can do a similar set of actions in my tests and it works fine. The
> >>>> only reason I could see things working like this is if it thinks you
> have
> >>>> one shard - a leader and a replica.
> >>>> >
> >>>> > - Mark
> >>>> >
> >>>> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
> >>>> >
> >>>> >> Glad to hear I don't need to set shards/self, but removing them
> didn't
> >>>> >> seem to change what I'm seeing.  Doing this still results in 2
> >>>> >> documents 1 on 8983 and 1 on 7574.
> >>>> >>
> >>>> >> String key = "1";
> >>>> >>
> >>>> >>               SolrInputDocument solrDoc = new SolrInputDocument();
> >>>> >>               solrDoc.setField("key", key);
> >>>> >>
> >>>> >>               solrDoc.addField("content_mvtxt", "initial value");
> >>>> >>
> >>>> >>               SolrServer server = servers.get("
> >>>> http://localhost:8983/solr/collection1");
> >>>> >>
> >>>> >>               UpdateRequest ureq = new UpdateRequest();
> >>>> >>               ureq.setParam("update.chain",
> "distrib-update-chain");
> >>>> >>               ureq.add(solrDoc);
> >>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
> >>>> >>               server.request(ureq);
> >>>> >>               server.commit();
> >>>> >>
> >>>> >>               solrDoc = new SolrInputDocument();
> >>>> >>               solrDoc.addField("key", key);
> >>>> >>               solrDoc.addField("content_mvtxt", "updated value");
> >>>> >>
> >>>> >>               server = servers.get("
> >>>> http://localhost:7574/solr/collection1");
> >>>> >>
> >>>> >>               ureq = new UpdateRequest();
> >>>> >>               ureq.setParam("update.chain",
> "distrib-update-chain");
> >>>> >>               ureq.add(solrDoc);
> >>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
> >>>> >>               server.request(ureq);
> >>>> >>               server.commit();
> >>>> >>
> >>>> >>               server = servers.get("
> >>>> http://localhost:8983/solr/collection1");
> >>>> >>
> >>>> >>
> >>>> >>               server.commit();
> >>>> >>               System.out.println("done");
> >>>> >>
> >>>> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <
> markrmiller@gmail.com>
> >>>> wrote:
> >>>> >>> So I dunno. You are running a zk server and running in zk mode
> right?
> >>>> >>>
> >>>> >>> You don't need to / shouldn't set a shards or self param. The
> shards
> >>>> are
> >>>> >>> figured out from Zookeeper.
> >>>> >>>
> >>>> >>> You always want to use the distrib-update-chain. Eventually it
> will
> >>>> >>> probably be part of the default chain and auto turn in zk mode.
> >>>> >>>
> >>>> >>> If you are running in zk mode attached to a zk server, this should
> >>>> work no
> >>>> >>> problem. You can add docs to any server and they will be
> forwarded to
> >>>> the
> >>>> >>> correct shard leader and then versioned and forwarded to replicas.
> >>>> >>>
> >>>> >>> You can also use the CloudSolrServer solrj client - that way you
> don't
> >>>> even
> >>>> >>> have to choose a server to send docs too - in which case if it
> went
> >>>> down
> >>>> >>> you would have to choose another manually - CloudSolrServer
> >>>> automatically
> >>>> >>> finds one that is up through ZooKeeper. Eventually it will also be
> >>>> smart
> >>>> >>> and do the hashing itself so that it can send directly to the
> shard
> >>>> leader
> >>>> >>> that the doc would be forwarded to anyway.
> >>>> >>>
> >>>> >>> - Mark
> >>>> >>>
> >>>> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <jej2003@gmail.com
> >
> >>>> wrote:
> >>>> >>>
> >>>> >>>> Really just trying to do a simple add and update test, the chain
> >>>> >>>> missing is just proof of my not understanding exactly how this is
> >>>> >>>> supposed to work.  I modified the code to this
> >>>> >>>>
> >>>> >>>>                String key = "1";
> >>>> >>>>
> >>>> >>>>                SolrInputDocument solrDoc = new
> SolrInputDocument();
> >>>> >>>>                solrDoc.setField("key", key);
> >>>> >>>>
> >>>> >>>>                 solrDoc.addField("content_mvtxt", "initial
> value");
> >>>> >>>>
> >>>> >>>>                SolrServer server = servers
> >>>> >>>>                                .get("
> >>>> >>>> http://localhost:8983/solr/collection1");
> >>>> >>>>
> >>>> >>>>                 UpdateRequest ureq = new UpdateRequest();
> >>>> >>>>                ureq.setParam("update.chain",
> "distrib-update-chain");
> >>>> >>>>                ureq.add(solrDoc);
> >>>> >>>>                ureq.setParam("shards",
> >>>> >>>>
> >>>> >>>>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>> >>>>                ureq.setParam("self", "foo");
> >>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
> >>>> >>>>                server.request(ureq);
> >>>> >>>>                 server.commit();
> >>>> >>>>
> >>>> >>>>                solrDoc = new SolrInputDocument();
> >>>> >>>>                solrDoc.addField("key", key);
> >>>> >>>>                 solrDoc.addField("content_mvtxt", "updated
> value");
> >>>> >>>>
> >>>> >>>>                server = servers.get("
> >>>> >>>> http://localhost:7574/solr/collection1");
> >>>> >>>>
> >>>> >>>>                 ureq = new UpdateRequest();
> >>>> >>>>                ureq.setParam("update.chain",
> "distrib-update-chain");
> >>>> >>>>                 //
> >>>> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
> >>>> >>>>                 ureq.add(solrDoc);
> >>>> >>>>                ureq.setParam("shards",
> >>>> >>>>
> >>>> >>>>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>> >>>>                ureq.setParam("self", "foo");
> >>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
> >>>> >>>>                server.request(ureq);
> >>>> >>>>                 // server.add(solrDoc);
> >>>> >>>>                server.commit();
> >>>> >>>>                server = servers.get("
> >>>> >>>> http://localhost:8983/solr/collection1");
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>                server.commit();
> >>>> >>>>                System.out.println("done");
> >>>> >>>>
> >>>> >>>> but I'm still seeing the doc appear on both shards.    After the
> first
> >>>> >>>> commit I see the doc on 8983 with "initial value".  after the
> second
> >>>> >>>> commit I see the updated value on 7574 and the old on 8983.
>  After the
> >>>> >>>> final commit the doc on 8983 gets updated.
> >>>> >>>>
> >>>> >>>> Is there something wrong with my test?
> >>>> >>>>
> >>>> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <
> markrmiller@gmail.com>
> >>>> >>>> wrote:
> >>>> >>>>> Getting late - didn't really pay attention to your code I guess
> - why
> >>>> >>>> are you adding the first doc without specifying the distrib
> update
> >>>> chain?
> >>>> >>>> This is not really supported. It's going to just go to the
> server you
> >>>> >>>> specified - even with everything setup right, the update might
> then
> >>>> go to
> >>>> >>>> that same server or the other one depending on how it hashes. You
> >>>> really
> >>>> >>>> want to just always use the distrib update chain.  I guess I
> don't yet
> >>>> >>>> understand what you are trying to test.
> >>>> >>>>>
> >>>> >>>>> Sent from my iPad
> >>>> >>>>>
> >>>> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <markrmiller@gmail.com
> >
> >>>> wrote:
> >>>> >>>>>
> >>>> >>>>>> Not sure offhand - but things will be funky if you don't
> specify the
> >>>> >>>> correct numShards.
> >>>> >>>>>>
> >>>> >>>>>> The instance to shard assignment should be using numShards to
> >>>> assign.
> >>>> >>>> But then the hash to shard mapping actually goes on the number of
> >>>> shards it
> >>>> >>>> finds registered in ZK (it doesn't have to, but really these
> should be
> >>>> >>>> equal).
> >>>> >>>>>>
> >>>> >>>>>> So basically you are saying, I want 3 partitions, but you are
> only
> >>>> >>>> starting up 2 nodes, and the code is just not happy about that
> I'd
> >>>> guess.
> >>>> >>>> For the system to work properly, you have to fire up at least as
> many
> >>>> >>>> servers as numShards.
> >>>> >>>>>>
> >>>> >>>>>> What are you trying to do? 2 partitions with no replicas, or
> one
> >>>> >>>> partition with one replica?
> >>>> >>>>>>
> >>>> >>>>>> In either case, I think you will have better luck if you fire
> up at
> >>>> >>>> least as many servers as the numShards setting. Or lower the
> numShards
> >>>> >>>> setting.
> >>>> >>>>>>
> >>>> >>>>>> This is all a work in progress by the way - what you are
> trying to
> >>>> test
> >>>> >>>> should work if things are setup right though.
> >>>> >>>>>>
> >>>> >>>>>> - Mark
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
> >>>> >>>>>>
> >>>> >>>>>>> Thanks for the quick response.  With that change (have not
> done
> >>>> >>>>>>> numShards yet) shard1 got updated.  But now when executing the
> >>>> >>>>>>> following queries I get information back from both, which
> doesn't
> >>>> seem
> >>>> >>>>>>> right
> >>>> >>>>>>>
> >>>> >>>>>>> http://localhost:7574/solr/select/?q=*:*
> >>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> >>>> >>>> value</str></doc>
> >>>> >>>>>>>
> >>>> >>>>>>> http://localhost:8983/solr/select?q=*:*
> >>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> >>>> >>>> value</str></doc>
> >>>> >>>>>>>
> >>>> >>>>>>>
> >>>> >>>>>>>
> >>>> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
> >>>> markrmiller@gmail.com>
> >>>> >>>> wrote:
> >>>> >>>>>>>> Hmm...sorry bout that - so my first guess is that right now
> we are
> >>>> >>>> not distributing a commit (easy to add, just have not done it).
> >>>> >>>>>>>>
> >>>> >>>>>>>> Right now I explicitly commit on each server for tests.
> >>>> >>>>>>>>
> >>>> >>>>>>>> Can you try explicitly committing on server1 after updating
> the
> >>>> doc
> >>>> >>>> on server 2?
> >>>> >>>>>>>>
> >>>> >>>>>>>> I can start distributing commits tomorrow - been meaning to
> do it
> >>>> for
> >>>> >>>> my own convenience anyhow.
> >>>> >>>>>>>>
> >>>> >>>>>>>> Also, you want to pass the sys property numShards=1 on
> startup. I
> >>>> >>>> think it defaults to 3. That will give you one leader and one
> replica.
> >>>> >>>>>>>>
> >>>> >>>>>>>> - Mark
> >>>> >>>>>>>>
> >>>> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
> >>>> >>>>>>>>
> >>>> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I
> used the
> >>>> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup
> a 2
> >>>> shard
> >>>> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards,
> updated
> >>>> it
> >>>> >>>>>>>>> and sent the update to the other.  I don't see the
> modifications
> >>>> >>>>>>>>> though I only see the original document.  The following is
> the
> >>>> test
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> public void update() throws Exception {
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              String key = "1";
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              SolrInputDocument solrDoc = new
> SolrInputDocument();
> >>>> >>>>>>>>>              solrDoc.setField("key", key);
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              solrDoc.addField("content", "initial value");
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              SolrServer server = servers
> >>>> >>>>>>>>>                              .get("
> >>>> >>>> http://localhost:8983/solr/collection1");
> >>>> >>>>>>>>>              server.add(solrDoc);
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              server.commit();
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              solrDoc = new SolrInputDocument();
> >>>> >>>>>>>>>              solrDoc.addField("key", key);
> >>>> >>>>>>>>>              solrDoc.addField("content", "updated value");
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              server = servers.get("
> >>>> >>>> http://localhost:7574/solr/collection1");
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
> >>>> >>>>>>>>>              ureq.setParam("update.chain",
> >>>> "distrib-update-chain");
> >>>> >>>>>>>>>              ureq.add(solrDoc);
> >>>> >>>>>>>>>              ureq.setParam("shards",
> >>>> >>>>>>>>>
> >>>> >>>>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>> >>>>>>>>>              ureq.setParam("self", "foo");
> >>>> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
> >>>> >>>>>>>>>              server.request(ureq);
> >>>> >>>>>>>>>              System.out.println("done");
> >>>> >>>>>>>>>      }
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> key is my unique field in schema.xml
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> What am I doing wrong?
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <
> jej2003@gmail.com
> >>>> >
> >>>> >>>> wrote:
> >>>> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new
> shard
> >>>> >>>> would
> >>>> >>>>>>>>>> be simply updating the range assignments in ZK.  Where is
> this
> >>>> >>>>>>>>>> currently on the list of things to accomplish?  I don't
> have
> >>>> time to
> >>>> >>>>>>>>>> work on this now, but if you (or anyone) could provide
> >>>> direction I'd
> >>>> >>>>>>>>>> be willing to work on this when I had spare time.  I guess
> a
> >>>> JIRA
> >>>> >>>>>>>>>> detailing where/how to do this could help.  Not sure if the
> >>>> design
> >>>> >>>> has
> >>>> >>>>>>>>>> been thought out that far though.
> >>>> >>>>>>>>>>
> >>>> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
> >>>> markrmiller@gmail.com>
> >>>> >>>> wrote:
> >>>> >>>>>>>>>>> Right now lets say you have one shard - everything there
> >>>> hashes to
> >>>> >>>> range X.
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> Now you want to split that shard with an Index Splitter.
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> You divide range X in two - giving you two ranges - then
> you
> >>>> start
> >>>> >>>> splitting. This is where the current Splitter needs a little
> >>>> modification.
> >>>> >>>> You decide which doc should go into which new index by rehashing
> each
> >>>> doc
> >>>> >>>> id in the index you are splitting - if its hash is greater than
> X/2,
> >>>> it
> >>>> >>>> goes into index1 - if its less, index2. I think there are a
> couple
> >>>> current
> >>>> >>>> Splitter impls, but one of them does something like, give me an
> id -
> >>>> now if
> >>>> >>>> the id's in the index are above that id, goto index1, if below,
> >>>> index2. We
> >>>> >>>> need to instead do a quick hash rather than simple id compare.
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> Why do you need to do this on every shard?
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> The other part we need that we dont have is to store hash
> range
> >>>> >>>> assignments in zookeeper - we don't do that yet because it's not
> >>>> needed
> >>>> >>>> yet. Instead we currently just simply calculate that on the fly
> (too
> >>>> often
> >>>> >>>> at the moment - on every request :) I intend to fix that of
> course).
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> At the start, zk would say, for range X, goto this shard.
> After
> >>>> >>>> the split, it would say, for range less than X/2 goto the old
> node,
> >>>> for
> >>>> >>>> range greater than X/2 goto the new node.
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> - Mark
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm
> that's
> >>>> on
> >>>> >>>> the
> >>>> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds
> like
> >>>> there
> >>>> >>>> is
> >>>> >>>>>>>>>>>> some logic which is able to tell that a particular range
> >>>> should be
> >>>> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems
> like a
> >>>> trade
> >>>> >>>> off
> >>>> >>>>>>>>>>>> between repartitioning the entire index (on every shard)
> and
> >>>> >>>> having a
> >>>> >>>>>>>>>>>> custom hashing algorithm which is able to handle the
> situation
> >>>> >>>> where 2
> >>>> >>>>>>>>>>>> or more shards map to a particular range.
> >>>> >>>>>>>>>>>>
> >>>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
> >>>> >>>> markrmiller@gmail.com> wrote:
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
> >>>> contrib,
> >>>> >>>> but I'll
> >>>> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it
> >>>> would be
> >>>> >>>> to run
> >>>> >>>>>>>>>>>>>> this on all of the current shards indexes based on the
> hash
> >>>> >>>> algorithm.
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> Not something I've thought deeply about myself yet, but
> I
> >>>> think
> >>>> >>>> the idea would be to split as many as you felt you needed to.
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> If you wanted to keep the full balance always, this
> would
> >>>> mean
> >>>> >>>> splitting every shard at once, yes. But this depends on how many
> boxes
> >>>> >>>> (partitions) you are willing/able to add at a time.
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> You might just split one index to start - now it's hash
> range
> >>>> >>>> would be handled by two shards instead of one (if you have 3
> replicas
> >>>> per
> >>>> >>>> shard, this would mean adding 3 more boxes). When you needed to
> expand
> >>>> >>>> again, you would split another index that was still handling its
> full
> >>>> >>>> starting range. As you grow, once you split every original index,
> >>>> you'd
> >>>> >>>> start again, splitting one of the now half ranges.
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>> Is there also an index merger in contrib which could be
> >>>> used to
> >>>> >>>> merge
> >>>> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also
> has an
> >>>> >>>> admin command that can do this). But I'm not sure where this
> fits in?
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> - Mark
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
> >>>> >>>> markrmiller@gmail.com> wrote:
> >>>> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a
> lot of
> >>>> >>>> other stuff is
> >>>> >>>>>>>>>>>>>>> working solid at this point. But someone else could
> jump
> >>>> in!
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> A more long term solution may be to start using micro
> >>>> shards -
> >>>> >>>> each index
> >>>> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast
> to
> >>>> move
> >>>> >>>> mirco shards
> >>>> >>>>>>>>>>>>>>> around as you decide to change partitions. It's also
> less
> >>>> >>>> flexible as you
> >>>> >>>>>>>>>>>>>>> are limited by the number of micro shards you start
> with.
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> A more simple and likely first step is to use an index
> >>>> >>>> splitter . We
> >>>> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just
> need to
> >>>> >>>> modify it so
> >>>> >>>>>>>>>>>>>>> that it splits based on the hash of the document id.
> This
> >>>> is
> >>>> >>>> super
> >>>> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a little
> while
> >>>> on
> >>>> >>>> a huge index.
> >>>> >>>>>>>>>>>>>>> The current index splitter is a multi pass splitter -
> good
> >>>> >>>> enough to start
> >>>> >>>>>>>>>>>>>>> with, but most files under codec control these days,
> we
> >>>> may be
> >>>> >>>> able to make
> >>>> >>>>>>>>>>>>>>> a single pass splitter soon as well.
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> Eventually you could imagine using both options -
> micro
> >>>> shards
> >>>> >>>> that could
> >>>> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if
> micro
> >>>> shards
> >>>> >>>> will be
> >>>> >>>>>>>>>>>>>>> worth the extra complications myself...
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick a
> good
> >>>> >>>> number of
> >>>> >>>>>>>>>>>>>>> partitions to start given your expected data ;)
> Adding more
> >>>> >>>> replicas is
> >>>> >>>>>>>>>>>>>>> trivial though.
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> - Mark
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
> >>>> >>>> jej2003@gmail.com> wrote:
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>> Another question, is there any support for
> repartitioning
> >>>> of
> >>>> >>>> the index
> >>>> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
> >>>> approach for
> >>>> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm
> (and
> >>>> >>>> probably
> >>>> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned
> should a
> >>>> new
> >>>> >>>> shard be
> >>>> >>>>>>>>>>>>>>>> added.
> >>>> >>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
> >>>> >>>> jej2003@gmail.com> wrote:
> >>>> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
> >>>> >>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
> >>>> >>>> markrmiller@gmail.com>
> >>>> >>>>>>>>>>>>>>>> wrote:
> >>>> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
> >>>> >>>> jej2003@gmail.com>
> >>>> >>>>>>>>>>>>>>>> wrote:
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud
> branch
> >>>> and
> >>>> >>>> was
> >>>> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
> >>>> configuring the
> >>>> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
> >>>> >>>> solrconfig.xml needs
> >>>> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing
> work?
> >>>> >>>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at
> solrconfig-distrib-update.xml
> >>>> in
> >>>> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
> >>>> replication
> >>>> >>>> handler def,
> >>>> >>>>>>>>>>>>>>>>>> and an update chain with
> >>>> >>>> solr.DistributedUpdateProcessFactory in it.
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>> --
> >>>> >>>>>>>>>>>>>>>>>> - Mark
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
> >>>> >>>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> --
> >>>> >>>>>>>>>>>>>>> - Mark
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>>> http://www.lucidimagination.com
> >>>> >>>>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>> - Mark Miller
> >>>> >>>>>>>>>>>>> lucidimagination.com
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>> - Mark Miller
> >>>> >>>>>>>>>>> lucidimagination.com
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>>
> >>>> >>>>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>> - Mark Miller
> >>>> >>>>>>>> lucidimagination.com
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>>>
> >>>> >>>>>>
> >>>> >>>>>> - Mark Miller
> >>>> >>>>>> lucidimagination.com
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>
> >>>> >>>>
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> --
> >>>> >>> - Mark
> >>>> >>>
> >>>> >>> http://www.lucidimagination.com
> >>>> >>>
> >>>> >
> >>>> > - Mark Miller
> >>>> > lucidimagination.com
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> - Mark
> >>>
> >>> http://www.lucidimagination.com
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

So I just tried this out, seems like it does the things I asked about.

Really really cool stuff, it's progressed quite a bit in the time
since I took a snapshot of the branch.

Last question, how do you change numShards?  Is there a command you
can use to do this now? I understand there will be implications for
the hashing algorithm, but once the hash ranges are stored in ZK (is
there a separate JIRA for this or does this fall under 2358) I assume
that it would be a relatively simple index split (JIRA 2595?) and
updating the hash ranges in solr, essentially splitting the range
between the new and existing shard.  Is that right?

On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson <je...@gmail.com> wrote:
> I think I see it.....so if I understand this correctly you specify
> numShards as a system property, as new nodes come up they check ZK to
> see if they should be a new shard or a replica based on if numShards
> is met.  A few questions if a master goes down does a replica get
> promoted?  If a new shard needs to be added is it just a matter of
> starting a new solr instance with a higher numShards?  (understanding
> that index rebalancing does not happen automatically now, but
> presumably it could).
>
> On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson <je...@gmail.com> wrote:
>> How does it determine the number of shards to create?  How many
>> replicas to create?
>>
>> On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller <ma...@gmail.com> wrote:
>>> Ah, okay - you are setting the shards in solr.xml - thats still an option
>>> to force a node to a particular shard - but if you take that out, shards
>>> will be auto assigned.
>>>
>>> By the way, because of the version code, distrib deletes don't work at the
>>> moment - will get to that next week.
>>>
>>> - Mark
>>>
>>> On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>>> So I'm a fool.  I did set the numShards, the issue was so trivial it's
>>>> embarrassing.  I did indeed have it setup as a replica, the shard
>>>> names in solr.xml were both shard1.  This worked as I expected now.
>>>>
>>>> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com> wrote:
>>>> >
>>>> > They are unused params, so removing them wouldn't help anything.
>>>> >
>>>> > You might just want to wait till we are further along before playing
>>>> with it.
>>>> >
>>>> > Or if you submit your full self contained test, I can see what's going
>>>> on (eg its still unclear if you have started setting numShards?).
>>>> >
>>>> > I can do a similar set of actions in my tests and it works fine. The
>>>> only reason I could see things working like this is if it thinks you have
>>>> one shard - a leader and a replica.
>>>> >
>>>> > - Mark
>>>> >
>>>> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
>>>> >
>>>> >> Glad to hear I don't need to set shards/self, but removing them didn't
>>>> >> seem to change what I'm seeing.  Doing this still results in 2
>>>> >> documents 1 on 8983 and 1 on 7574.
>>>> >>
>>>> >> String key = "1";
>>>> >>
>>>> >>               SolrInputDocument solrDoc = new SolrInputDocument();
>>>> >>               solrDoc.setField("key", key);
>>>> >>
>>>> >>               solrDoc.addField("content_mvtxt", "initial value");
>>>> >>
>>>> >>               SolrServer server = servers.get("
>>>> http://localhost:8983/solr/collection1");
>>>> >>
>>>> >>               UpdateRequest ureq = new UpdateRequest();
>>>> >>               ureq.setParam("update.chain", "distrib-update-chain");
>>>> >>               ureq.add(solrDoc);
>>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>>>> >>               server.request(ureq);
>>>> >>               server.commit();
>>>> >>
>>>> >>               solrDoc = new SolrInputDocument();
>>>> >>               solrDoc.addField("key", key);
>>>> >>               solrDoc.addField("content_mvtxt", "updated value");
>>>> >>
>>>> >>               server = servers.get("
>>>> http://localhost:7574/solr/collection1");
>>>> >>
>>>> >>               ureq = new UpdateRequest();
>>>> >>               ureq.setParam("update.chain", "distrib-update-chain");
>>>> >>               ureq.add(solrDoc);
>>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>>>> >>               server.request(ureq);
>>>> >>               server.commit();
>>>> >>
>>>> >>               server = servers.get("
>>>> http://localhost:8983/solr/collection1");
>>>> >>
>>>> >>
>>>> >>               server.commit();
>>>> >>               System.out.println("done");
>>>> >>
>>>> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>> >>> So I dunno. You are running a zk server and running in zk mode right?
>>>> >>>
>>>> >>> You don't need to / shouldn't set a shards or self param. The shards
>>>> are
>>>> >>> figured out from Zookeeper.
>>>> >>>
>>>> >>> You always want to use the distrib-update-chain. Eventually it will
>>>> >>> probably be part of the default chain and auto turn in zk mode.
>>>> >>>
>>>> >>> If you are running in zk mode attached to a zk server, this should
>>>> work no
>>>> >>> problem. You can add docs to any server and they will be forwarded to
>>>> the
>>>> >>> correct shard leader and then versioned and forwarded to replicas.
>>>> >>>
>>>> >>> You can also use the CloudSolrServer solrj client - that way you don't
>>>> even
>>>> >>> have to choose a server to send docs too - in which case if it went
>>>> down
>>>> >>> you would have to choose another manually - CloudSolrServer
>>>> automatically
>>>> >>> finds one that is up through ZooKeeper. Eventually it will also be
>>>> smart
>>>> >>> and do the hashing itself so that it can send directly to the shard
>>>> leader
>>>> >>> that the doc would be forwarded to anyway.
>>>> >>>
>>>> >>> - Mark
>>>> >>>
>>>> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>> Really just trying to do a simple add and update test, the chain
>>>> >>>> missing is just proof of my not understanding exactly how this is
>>>> >>>> supposed to work.  I modified the code to this
>>>> >>>>
>>>> >>>>                String key = "1";
>>>> >>>>
>>>> >>>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>>> >>>>                solrDoc.setField("key", key);
>>>> >>>>
>>>> >>>>                 solrDoc.addField("content_mvtxt", "initial value");
>>>> >>>>
>>>> >>>>                SolrServer server = servers
>>>> >>>>                                .get("
>>>> >>>> http://localhost:8983/solr/collection1");
>>>> >>>>
>>>> >>>>                 UpdateRequest ureq = new UpdateRequest();
>>>> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>> >>>>                ureq.add(solrDoc);
>>>> >>>>                ureq.setParam("shards",
>>>> >>>>
>>>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>> >>>>                ureq.setParam("self", "foo");
>>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>> >>>>                server.request(ureq);
>>>> >>>>                 server.commit();
>>>> >>>>
>>>> >>>>                solrDoc = new SolrInputDocument();
>>>> >>>>                solrDoc.addField("key", key);
>>>> >>>>                 solrDoc.addField("content_mvtxt", "updated value");
>>>> >>>>
>>>> >>>>                server = servers.get("
>>>> >>>> http://localhost:7574/solr/collection1");
>>>> >>>>
>>>> >>>>                 ureq = new UpdateRequest();
>>>> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>> >>>>                 //
>>>> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>>> >>>>                 ureq.add(solrDoc);
>>>> >>>>                ureq.setParam("shards",
>>>> >>>>
>>>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>> >>>>                ureq.setParam("self", "foo");
>>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>> >>>>                server.request(ureq);
>>>> >>>>                 // server.add(solrDoc);
>>>> >>>>                server.commit();
>>>> >>>>                server = servers.get("
>>>> >>>> http://localhost:8983/solr/collection1");
>>>> >>>>
>>>> >>>>
>>>> >>>>                server.commit();
>>>> >>>>                System.out.println("done");
>>>> >>>>
>>>> >>>> but I'm still seeing the doc appear on both shards.    After the first
>>>> >>>> commit I see the doc on 8983 with "initial value".  after the second
>>>> >>>> commit I see the updated value on 7574 and the old on 8983.  After the
>>>> >>>> final commit the doc on 8983 gets updated.
>>>> >>>>
>>>> >>>> Is there something wrong with my test?
>>>> >>>>
>>>> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
>>>> >>>> wrote:
>>>> >>>>> Getting late - didn't really pay attention to your code I guess - why
>>>> >>>> are you adding the first doc without specifying the distrib update
>>>> chain?
>>>> >>>> This is not really supported. It's going to just go to the server you
>>>> >>>> specified - even with everything setup right, the update might then
>>>> go to
>>>> >>>> that same server or the other one depending on how it hashes. You
>>>> really
>>>> >>>> want to just always use the distrib update chain.  I guess I don't yet
>>>> >>>> understand what you are trying to test.
>>>> >>>>>
>>>> >>>>> Sent from my iPad
>>>> >>>>>
>>>> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Not sure offhand - but things will be funky if you don't specify the
>>>> >>>> correct numShards.
>>>> >>>>>>
>>>> >>>>>> The instance to shard assignment should be using numShards to
>>>> assign.
>>>> >>>> But then the hash to shard mapping actually goes on the number of
>>>> shards it
>>>> >>>> finds registered in ZK (it doesn't have to, but really these should be
>>>> >>>> equal).
>>>> >>>>>>
>>>> >>>>>> So basically you are saying, I want 3 partitions, but you are only
>>>> >>>> starting up 2 nodes, and the code is just not happy about that I'd
>>>> guess.
>>>> >>>> For the system to work properly, you have to fire up at least as many
>>>> >>>> servers as numShards.
>>>> >>>>>>
>>>> >>>>>> What are you trying to do? 2 partitions with no replicas, or one
>>>> >>>> partition with one replica?
>>>> >>>>>>
>>>> >>>>>> In either case, I think you will have better luck if you fire up at
>>>> >>>> least as many servers as the numShards setting. Or lower the numShards
>>>> >>>> setting.
>>>> >>>>>>
>>>> >>>>>> This is all a work in progress by the way - what you are trying to
>>>> test
>>>> >>>> should work if things are setup right though.
>>>> >>>>>>
>>>> >>>>>> - Mark
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>>>> >>>>>>
>>>> >>>>>>> Thanks for the quick response.  With that change (have not done
>>>> >>>>>>> numShards yet) shard1 got updated.  But now when executing the
>>>> >>>>>>> following queries I get information back from both, which doesn't
>>>> seem
>>>> >>>>>>> right
>>>> >>>>>>>
>>>> >>>>>>> http://localhost:7574/solr/select/?q=*:*
>>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>>> >>>> value</str></doc>
>>>> >>>>>>>
>>>> >>>>>>> http://localhost:8983/solr/select?q=*:*
>>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>>> >>>> value</str></doc>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
>>>> markrmiller@gmail.com>
>>>> >>>> wrote:
>>>> >>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
>>>> >>>> not distributing a commit (easy to add, just have not done it).
>>>> >>>>>>>>
>>>> >>>>>>>> Right now I explicitly commit on each server for tests.
>>>> >>>>>>>>
>>>> >>>>>>>> Can you try explicitly committing on server1 after updating the
>>>> doc
>>>> >>>> on server 2?
>>>> >>>>>>>>
>>>> >>>>>>>> I can start distributing commits tomorrow - been meaning to do it
>>>> for
>>>> >>>> my own convenience anyhow.
>>>> >>>>>>>>
>>>> >>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
>>>> >>>> think it defaults to 3. That will give you one leader and one replica.
>>>> >>>>>>>>
>>>> >>>>>>>> - Mark
>>>> >>>>>>>>
>>>> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>>> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2
>>>> shard
>>>> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated
>>>> it
>>>> >>>>>>>>> and sent the update to the other.  I don't see the modifications
>>>> >>>>>>>>> though I only see the original document.  The following is the
>>>> test
>>>> >>>>>>>>>
>>>> >>>>>>>>> public void update() throws Exception {
>>>> >>>>>>>>>
>>>> >>>>>>>>>              String key = "1";
>>>> >>>>>>>>>
>>>> >>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>>> >>>>>>>>>              solrDoc.setField("key", key);
>>>> >>>>>>>>>
>>>> >>>>>>>>>              solrDoc.addField("content", "initial value");
>>>> >>>>>>>>>
>>>> >>>>>>>>>              SolrServer server = servers
>>>> >>>>>>>>>                              .get("
>>>> >>>> http://localhost:8983/solr/collection1");
>>>> >>>>>>>>>              server.add(solrDoc);
>>>> >>>>>>>>>
>>>> >>>>>>>>>              server.commit();
>>>> >>>>>>>>>
>>>> >>>>>>>>>              solrDoc = new SolrInputDocument();
>>>> >>>>>>>>>              solrDoc.addField("key", key);
>>>> >>>>>>>>>              solrDoc.addField("content", "updated value");
>>>> >>>>>>>>>
>>>> >>>>>>>>>              server = servers.get("
>>>> >>>> http://localhost:7574/solr/collection1");
>>>> >>>>>>>>>
>>>> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>>>> >>>>>>>>>              ureq.setParam("update.chain",
>>>> "distrib-update-chain");
>>>> >>>>>>>>>              ureq.add(solrDoc);
>>>> >>>>>>>>>              ureq.setParam("shards",
>>>> >>>>>>>>>
>>>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>> >>>>>>>>>              ureq.setParam("self", "foo");
>>>> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>>> >>>>>>>>>              server.request(ureq);
>>>> >>>>>>>>>              System.out.println("done");
>>>> >>>>>>>>>      }
>>>> >>>>>>>>>
>>>> >>>>>>>>> key is my unique field in schema.xml
>>>> >>>>>>>>>
>>>> >>>>>>>>> What am I doing wrong?
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2003@gmail.com
>>>> >
>>>> >>>> wrote:
>>>> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>>>> >>>> would
>>>> >>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
>>>> >>>>>>>>>> currently on the list of things to accomplish?  I don't have
>>>> time to
>>>> >>>>>>>>>> work on this now, but if you (or anyone) could provide
>>>> direction I'd
>>>> >>>>>>>>>> be willing to work on this when I had spare time.  I guess a
>>>> JIRA
>>>> >>>>>>>>>> detailing where/how to do this could help.  Not sure if the
>>>> design
>>>> >>>> has
>>>> >>>>>>>>>> been thought out that far though.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
>>>> markrmiller@gmail.com>
>>>> >>>> wrote:
>>>> >>>>>>>>>>> Right now lets say you have one shard - everything there
>>>> hashes to
>>>> >>>> range X.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> You divide range X in two - giving you two ranges - then you
>>>> start
>>>> >>>> splitting. This is where the current Splitter needs a little
>>>> modification.
>>>> >>>> You decide which doc should go into which new index by rehashing each
>>>> doc
>>>> >>>> id in the index you are splitting - if its hash is greater than X/2,
>>>> it
>>>> >>>> goes into index1 - if its less, index2. I think there are a couple
>>>> current
>>>> >>>> Splitter impls, but one of them does something like, give me an id -
>>>> now if
>>>> >>>> the id's in the index are above that id, goto index1, if below,
>>>> index2. We
>>>> >>>> need to instead do a quick hash rather than simple id compare.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Why do you need to do this on every shard?
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> The other part we need that we dont have is to store hash range
>>>> >>>> assignments in zookeeper - we don't do that yet because it's not
>>>> needed
>>>> >>>> yet. Instead we currently just simply calculate that on the fly (too
>>>> often
>>>> >>>> at the moment - on every request :) I intend to fix that of course).
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
>>>> >>>> the split, it would say, for range less than X/2 goto the old node,
>>>> for
>>>> >>>> range greater than X/2 goto the new node.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's
>>>> on
>>>> >>>> the
>>>> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like
>>>> there
>>>> >>>> is
>>>> >>>>>>>>>>>> some logic which is able to tell that a particular range
>>>> should be
>>>> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a
>>>> trade
>>>> >>>> off
>>>> >>>>>>>>>>>> between repartitioning the entire index (on every shard) and
>>>> >>>> having a
>>>> >>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
>>>> >>>> where 2
>>>> >>>>>>>>>>>> or more shards map to a particular range.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>>>> >>>> markrmiller@gmail.com> wrote:
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
>>>> contrib,
>>>> >>>> but I'll
>>>> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it
>>>> would be
>>>> >>>> to run
>>>> >>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
>>>> >>>> algorithm.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I
>>>> think
>>>> >>>> the idea would be to split as many as you felt you needed to.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> If you wanted to keep the full balance always, this would
>>>> mean
>>>> >>>> splitting every shard at once, yes. But this depends on how many boxes
>>>> >>>> (partitions) you are willing/able to add at a time.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> You might just split one index to start - now it's hash range
>>>> >>>> would be handled by two shards instead of one (if you have 3 replicas
>>>> per
>>>> >>>> shard, this would mean adding 3 more boxes). When you needed to expand
>>>> >>>> again, you would split another index that was still handling its full
>>>> >>>> starting range. As you grow, once you split every original index,
>>>> you'd
>>>> >>>> start again, splitting one of the now half ranges.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> Is there also an index merger in contrib which could be
>>>> used to
>>>> >>>> merge
>>>> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>>>> >>>> admin command that can do this). But I'm not sure where this fits in?
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>>>> >>>> markrmiller@gmail.com> wrote:
>>>> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>>>> >>>> other stuff is
>>>> >>>>>>>>>>>>>>> working solid at this point. But someone else could jump
>>>> in!
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> A more long term solution may be to start using micro
>>>> shards -
>>>> >>>> each index
>>>> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to
>>>> move
>>>> >>>> mirco shards
>>>> >>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
>>>> >>>> flexible as you
>>>> >>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>>>> >>>> splitter . We
>>>> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
>>>> >>>> modify it so
>>>> >>>>>>>>>>>>>>> that it splits based on the hash of the document id. This
>>>> is
>>>> >>>> super
>>>> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while
>>>> on
>>>> >>>> a huge index.
>>>> >>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
>>>> >>>> enough to start
>>>> >>>>>>>>>>>>>>> with, but most files under codec control these days, we
>>>> may be
>>>> >>>> able to make
>>>> >>>>>>>>>>>>>>> a single pass splitter soon as well.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> Eventually you could imagine using both options - micro
>>>> shards
>>>> >>>> that could
>>>> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro
>>>> shards
>>>> >>>> will be
>>>> >>>>>>>>>>>>>>> worth the extra complications myself...
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
>>>> >>>> number of
>>>> >>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
>>>> >>>> replicas is
>>>> >>>>>>>>>>>>>>> trivial though.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>>>> >>>> jej2003@gmail.com> wrote:
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> Another question, is there any support for repartitioning
>>>> of
>>>> >>>> the index
>>>> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
>>>> approach for
>>>> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>>>> >>>> probably
>>>> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a
>>>> new
>>>> >>>> shard be
>>>> >>>>>>>>>>>>>>>> added.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>>>> >>>> jej2003@gmail.com> wrote:
>>>> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>>>> >>>> markrmiller@gmail.com>
>>>> >>>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>>>> >>>> jej2003@gmail.com>
>>>> >>>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch
>>>> and
>>>> >>>> was
>>>> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
>>>> configuring the
>>>> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>>>> >>>> solrconfig.xml needs
>>>> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml
>>>> in
>>>> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
>>>> replication
>>>> >>>> handler def,
>>>> >>>>>>>>>>>>>>>>>> and an update chain with
>>>> >>>> solr.DistributedUpdateProcessFactory in it.
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> --
>>>> >>>>>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> --
>>>> >>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> - Mark Miller
>>>> >>>>>>>>>>>>> lucidimagination.com
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> - Mark Miller
>>>> >>>>>>>>>>> lucidimagination.com
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> - Mark Miller
>>>> >>>>>>>> lucidimagination.com
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>
>>>> >>>>>> - Mark Miller
>>>> >>>>>> lucidimagination.com
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> - Mark
>>>> >>>
>>>> >>> http://www.lucidimagination.com
>>>> >>>
>>>> >
>>>> > - Mark Miller
>>>> > lucidimagination.com
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

I think I see it.....so if I understand this correctly you specify
numShards as a system property, as new nodes come up they check ZK to
see if they should be a new shard or a replica based on if numShards
is met.  A few questions if a master goes down does a replica get
promoted?  If a new shard needs to be added is it just a matter of
starting a new solr instance with a higher numShards?  (understanding
that index rebalancing does not happen automatically now, but
presumably it could).

On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson <je...@gmail.com> wrote:
> How does it determine the number of shards to create?  How many
> replicas to create?
>
> On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller <ma...@gmail.com> wrote:
>> Ah, okay - you are setting the shards in solr.xml - thats still an option
>> to force a node to a particular shard - but if you take that out, shards
>> will be auto assigned.
>>
>> By the way, because of the version code, distrib deletes don't work at the
>> moment - will get to that next week.
>>
>> - Mark
>>
>> On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>>> So I'm a fool.  I did set the numShards, the issue was so trivial it's
>>> embarrassing.  I did indeed have it setup as a replica, the shard
>>> names in solr.xml were both shard1.  This worked as I expected now.
>>>
>>> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com> wrote:
>>> >
>>> > They are unused params, so removing them wouldn't help anything.
>>> >
>>> > You might just want to wait till we are further along before playing
>>> with it.
>>> >
>>> > Or if you submit your full self contained test, I can see what's going
>>> on (eg its still unclear if you have started setting numShards?).
>>> >
>>> > I can do a similar set of actions in my tests and it works fine. The
>>> only reason I could see things working like this is if it thinks you have
>>> one shard - a leader and a replica.
>>> >
>>> > - Mark
>>> >
>>> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
>>> >
>>> >> Glad to hear I don't need to set shards/self, but removing them didn't
>>> >> seem to change what I'm seeing.  Doing this still results in 2
>>> >> documents 1 on 8983 and 1 on 7574.
>>> >>
>>> >> String key = "1";
>>> >>
>>> >>               SolrInputDocument solrDoc = new SolrInputDocument();
>>> >>               solrDoc.setField("key", key);
>>> >>
>>> >>               solrDoc.addField("content_mvtxt", "initial value");
>>> >>
>>> >>               SolrServer server = servers.get("
>>> http://localhost:8983/solr/collection1");
>>> >>
>>> >>               UpdateRequest ureq = new UpdateRequest();
>>> >>               ureq.setParam("update.chain", "distrib-update-chain");
>>> >>               ureq.add(solrDoc);
>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>>> >>               server.request(ureq);
>>> >>               server.commit();
>>> >>
>>> >>               solrDoc = new SolrInputDocument();
>>> >>               solrDoc.addField("key", key);
>>> >>               solrDoc.addField("content_mvtxt", "updated value");
>>> >>
>>> >>               server = servers.get("
>>> http://localhost:7574/solr/collection1");
>>> >>
>>> >>               ureq = new UpdateRequest();
>>> >>               ureq.setParam("update.chain", "distrib-update-chain");
>>> >>               ureq.add(solrDoc);
>>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>>> >>               server.request(ureq);
>>> >>               server.commit();
>>> >>
>>> >>               server = servers.get("
>>> http://localhost:8983/solr/collection1");
>>> >>
>>> >>
>>> >>               server.commit();
>>> >>               System.out.println("done");
>>> >>
>>> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>> >>> So I dunno. You are running a zk server and running in zk mode right?
>>> >>>
>>> >>> You don't need to / shouldn't set a shards or self param. The shards
>>> are
>>> >>> figured out from Zookeeper.
>>> >>>
>>> >>> You always want to use the distrib-update-chain. Eventually it will
>>> >>> probably be part of the default chain and auto turn in zk mode.
>>> >>>
>>> >>> If you are running in zk mode attached to a zk server, this should
>>> work no
>>> >>> problem. You can add docs to any server and they will be forwarded to
>>> the
>>> >>> correct shard leader and then versioned and forwarded to replicas.
>>> >>>
>>> >>> You can also use the CloudSolrServer solrj client - that way you don't
>>> even
>>> >>> have to choose a server to send docs too - in which case if it went
>>> down
>>> >>> you would have to choose another manually - CloudSolrServer
>>> automatically
>>> >>> finds one that is up through ZooKeeper. Eventually it will also be
>>> smart
>>> >>> and do the hashing itself so that it can send directly to the shard
>>> leader
>>> >>> that the doc would be forwarded to anyway.
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>> Really just trying to do a simple add and update test, the chain
>>> >>>> missing is just proof of my not understanding exactly how this is
>>> >>>> supposed to work.  I modified the code to this
>>> >>>>
>>> >>>>                String key = "1";
>>> >>>>
>>> >>>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>> >>>>                solrDoc.setField("key", key);
>>> >>>>
>>> >>>>                 solrDoc.addField("content_mvtxt", "initial value");
>>> >>>>
>>> >>>>                SolrServer server = servers
>>> >>>>                                .get("
>>> >>>> http://localhost:8983/solr/collection1");
>>> >>>>
>>> >>>>                 UpdateRequest ureq = new UpdateRequest();
>>> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>> >>>>                ureq.add(solrDoc);
>>> >>>>                ureq.setParam("shards",
>>> >>>>
>>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>> >>>>                ureq.setParam("self", "foo");
>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>> >>>>                server.request(ureq);
>>> >>>>                 server.commit();
>>> >>>>
>>> >>>>                solrDoc = new SolrInputDocument();
>>> >>>>                solrDoc.addField("key", key);
>>> >>>>                 solrDoc.addField("content_mvtxt", "updated value");
>>> >>>>
>>> >>>>                server = servers.get("
>>> >>>> http://localhost:7574/solr/collection1");
>>> >>>>
>>> >>>>                 ureq = new UpdateRequest();
>>> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>> >>>>                 //
>>> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>> >>>>                 ureq.add(solrDoc);
>>> >>>>                ureq.setParam("shards",
>>> >>>>
>>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>> >>>>                ureq.setParam("self", "foo");
>>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>> >>>>                server.request(ureq);
>>> >>>>                 // server.add(solrDoc);
>>> >>>>                server.commit();
>>> >>>>                server = servers.get("
>>> >>>> http://localhost:8983/solr/collection1");
>>> >>>>
>>> >>>>
>>> >>>>                server.commit();
>>> >>>>                System.out.println("done");
>>> >>>>
>>> >>>> but I'm still seeing the doc appear on both shards.    After the first
>>> >>>> commit I see the doc on 8983 with "initial value".  after the second
>>> >>>> commit I see the updated value on 7574 and the old on 8983.  After the
>>> >>>> final commit the doc on 8983 gets updated.
>>> >>>>
>>> >>>> Is there something wrong with my test?
>>> >>>>
>>> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
>>> >>>> wrote:
>>> >>>>> Getting late - didn't really pay attention to your code I guess - why
>>> >>>> are you adding the first doc without specifying the distrib update
>>> chain?
>>> >>>> This is not really supported. It's going to just go to the server you
>>> >>>> specified - even with everything setup right, the update might then
>>> go to
>>> >>>> that same server or the other one depending on how it hashes. You
>>> really
>>> >>>> want to just always use the distrib update chain.  I guess I don't yet
>>> >>>> understand what you are trying to test.
>>> >>>>>
>>> >>>>> Sent from my iPad
>>> >>>>>
>>> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>>> Not sure offhand - but things will be funky if you don't specify the
>>> >>>> correct numShards.
>>> >>>>>>
>>> >>>>>> The instance to shard assignment should be using numShards to
>>> assign.
>>> >>>> But then the hash to shard mapping actually goes on the number of
>>> shards it
>>> >>>> finds registered in ZK (it doesn't have to, but really these should be
>>> >>>> equal).
>>> >>>>>>
>>> >>>>>> So basically you are saying, I want 3 partitions, but you are only
>>> >>>> starting up 2 nodes, and the code is just not happy about that I'd
>>> guess.
>>> >>>> For the system to work properly, you have to fire up at least as many
>>> >>>> servers as numShards.
>>> >>>>>>
>>> >>>>>> What are you trying to do? 2 partitions with no replicas, or one
>>> >>>> partition with one replica?
>>> >>>>>>
>>> >>>>>> In either case, I think you will have better luck if you fire up at
>>> >>>> least as many servers as the numShards setting. Or lower the numShards
>>> >>>> setting.
>>> >>>>>>
>>> >>>>>> This is all a work in progress by the way - what you are trying to
>>> test
>>> >>>> should work if things are setup right though.
>>> >>>>>>
>>> >>>>>> - Mark
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>>> >>>>>>
>>> >>>>>>> Thanks for the quick response.  With that change (have not done
>>> >>>>>>> numShards yet) shard1 got updated.  But now when executing the
>>> >>>>>>> following queries I get information back from both, which doesn't
>>> seem
>>> >>>>>>> right
>>> >>>>>>>
>>> >>>>>>> http://localhost:7574/solr/select/?q=*:*
>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>> >>>> value</str></doc>
>>> >>>>>>>
>>> >>>>>>> http://localhost:8983/solr/select?q=*:*
>>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>> >>>> value</str></doc>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
>>> markrmiller@gmail.com>
>>> >>>> wrote:
>>> >>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
>>> >>>> not distributing a commit (easy to add, just have not done it).
>>> >>>>>>>>
>>> >>>>>>>> Right now I explicitly commit on each server for tests.
>>> >>>>>>>>
>>> >>>>>>>> Can you try explicitly committing on server1 after updating the
>>> doc
>>> >>>> on server 2?
>>> >>>>>>>>
>>> >>>>>>>> I can start distributing commits tomorrow - been meaning to do it
>>> for
>>> >>>> my own convenience anyhow.
>>> >>>>>>>>
>>> >>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
>>> >>>> think it defaults to 3. That will give you one leader and one replica.
>>> >>>>>>>>
>>> >>>>>>>> - Mark
>>> >>>>>>>>
>>> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>> >>>>>>>>
>>> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2
>>> shard
>>> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated
>>> it
>>> >>>>>>>>> and sent the update to the other.  I don't see the modifications
>>> >>>>>>>>> though I only see the original document.  The following is the
>>> test
>>> >>>>>>>>>
>>> >>>>>>>>> public void update() throws Exception {
>>> >>>>>>>>>
>>> >>>>>>>>>              String key = "1";
>>> >>>>>>>>>
>>> >>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>> >>>>>>>>>              solrDoc.setField("key", key);
>>> >>>>>>>>>
>>> >>>>>>>>>              solrDoc.addField("content", "initial value");
>>> >>>>>>>>>
>>> >>>>>>>>>              SolrServer server = servers
>>> >>>>>>>>>                              .get("
>>> >>>> http://localhost:8983/solr/collection1");
>>> >>>>>>>>>              server.add(solrDoc);
>>> >>>>>>>>>
>>> >>>>>>>>>              server.commit();
>>> >>>>>>>>>
>>> >>>>>>>>>              solrDoc = new SolrInputDocument();
>>> >>>>>>>>>              solrDoc.addField("key", key);
>>> >>>>>>>>>              solrDoc.addField("content", "updated value");
>>> >>>>>>>>>
>>> >>>>>>>>>              server = servers.get("
>>> >>>> http://localhost:7574/solr/collection1");
>>> >>>>>>>>>
>>> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>>> >>>>>>>>>              ureq.setParam("update.chain",
>>> "distrib-update-chain");
>>> >>>>>>>>>              ureq.add(solrDoc);
>>> >>>>>>>>>              ureq.setParam("shards",
>>> >>>>>>>>>
>>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>> >>>>>>>>>              ureq.setParam("self", "foo");
>>> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>> >>>>>>>>>              server.request(ureq);
>>> >>>>>>>>>              System.out.println("done");
>>> >>>>>>>>>      }
>>> >>>>>>>>>
>>> >>>>>>>>> key is my unique field in schema.xml
>>> >>>>>>>>>
>>> >>>>>>>>> What am I doing wrong?
>>> >>>>>>>>>
>>> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2003@gmail.com
>>> >
>>> >>>> wrote:
>>> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>>> >>>> would
>>> >>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
>>> >>>>>>>>>> currently on the list of things to accomplish?  I don't have
>>> time to
>>> >>>>>>>>>> work on this now, but if you (or anyone) could provide
>>> direction I'd
>>> >>>>>>>>>> be willing to work on this when I had spare time.  I guess a
>>> JIRA
>>> >>>>>>>>>> detailing where/how to do this could help.  Not sure if the
>>> design
>>> >>>> has
>>> >>>>>>>>>> been thought out that far though.
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
>>> markrmiller@gmail.com>
>>> >>>> wrote:
>>> >>>>>>>>>>> Right now lets say you have one shard - everything there
>>> hashes to
>>> >>>> range X.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> You divide range X in two - giving you two ranges - then you
>>> start
>>> >>>> splitting. This is where the current Splitter needs a little
>>> modification.
>>> >>>> You decide which doc should go into which new index by rehashing each
>>> doc
>>> >>>> id in the index you are splitting - if its hash is greater than X/2,
>>> it
>>> >>>> goes into index1 - if its less, index2. I think there are a couple
>>> current
>>> >>>> Splitter impls, but one of them does something like, give me an id -
>>> now if
>>> >>>> the id's in the index are above that id, goto index1, if below,
>>> index2. We
>>> >>>> need to instead do a quick hash rather than simple id compare.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Why do you need to do this on every shard?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> The other part we need that we dont have is to store hash range
>>> >>>> assignments in zookeeper - we don't do that yet because it's not
>>> needed
>>> >>>> yet. Instead we currently just simply calculate that on the fly (too
>>> often
>>> >>>> at the moment - on every request :) I intend to fix that of course).
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
>>> >>>> the split, it would say, for range less than X/2 goto the old node,
>>> for
>>> >>>> range greater than X/2 goto the new node.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> - Mark
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's
>>> on
>>> >>>> the
>>> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like
>>> there
>>> >>>> is
>>> >>>>>>>>>>>> some logic which is able to tell that a particular range
>>> should be
>>> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a
>>> trade
>>> >>>> off
>>> >>>>>>>>>>>> between repartitioning the entire index (on every shard) and
>>> >>>> having a
>>> >>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
>>> >>>> where 2
>>> >>>>>>>>>>>> or more shards map to a particular range.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>>> >>>> markrmiller@gmail.com> wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
>>> contrib,
>>> >>>> but I'll
>>> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it
>>> would be
>>> >>>> to run
>>> >>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
>>> >>>> algorithm.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I
>>> think
>>> >>>> the idea would be to split as many as you felt you needed to.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> If you wanted to keep the full balance always, this would
>>> mean
>>> >>>> splitting every shard at once, yes. But this depends on how many boxes
>>> >>>> (partitions) you are willing/able to add at a time.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> You might just split one index to start - now it's hash range
>>> >>>> would be handled by two shards instead of one (if you have 3 replicas
>>> per
>>> >>>> shard, this would mean adding 3 more boxes). When you needed to expand
>>> >>>> again, you would split another index that was still handling its full
>>> >>>> starting range. As you grow, once you split every original index,
>>> you'd
>>> >>>> start again, splitting one of the now half ranges.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Is there also an index merger in contrib which could be
>>> used to
>>> >>>> merge
>>> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>>> >>>> admin command that can do this). But I'm not sure where this fits in?
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>>> >>>> markrmiller@gmail.com> wrote:
>>> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>>> >>>> other stuff is
>>> >>>>>>>>>>>>>>> working solid at this point. But someone else could jump
>>> in!
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> A more long term solution may be to start using micro
>>> shards -
>>> >>>> each index
>>> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to
>>> move
>>> >>>> mirco shards
>>> >>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
>>> >>>> flexible as you
>>> >>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>>> >>>> splitter . We
>>> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
>>> >>>> modify it so
>>> >>>>>>>>>>>>>>> that it splits based on the hash of the document id. This
>>> is
>>> >>>> super
>>> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while
>>> on
>>> >>>> a huge index.
>>> >>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
>>> >>>> enough to start
>>> >>>>>>>>>>>>>>> with, but most files under codec control these days, we
>>> may be
>>> >>>> able to make
>>> >>>>>>>>>>>>>>> a single pass splitter soon as well.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Eventually you could imagine using both options - micro
>>> shards
>>> >>>> that could
>>> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro
>>> shards
>>> >>>> will be
>>> >>>>>>>>>>>>>>> worth the extra complications myself...
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
>>> >>>> number of
>>> >>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
>>> >>>> replicas is
>>> >>>>>>>>>>>>>>> trivial though.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>>> >>>> jej2003@gmail.com> wrote:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Another question, is there any support for repartitioning
>>> of
>>> >>>> the index
>>> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
>>> approach for
>>> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>>> >>>> probably
>>> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a
>>> new
>>> >>>> shard be
>>> >>>>>>>>>>>>>>>> added.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>>> >>>> jej2003@gmail.com> wrote:
>>> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>>> >>>> markrmiller@gmail.com>
>>> >>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>>> >>>> jej2003@gmail.com>
>>> >>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch
>>> and
>>> >>>> was
>>> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
>>> configuring the
>>> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>>> >>>> solrconfig.xml needs
>>> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml
>>> in
>>> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
>>> replication
>>> >>>> handler def,
>>> >>>>>>>>>>>>>>>>>> and an update chain with
>>> >>>> solr.DistributedUpdateProcessFactory in it.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> --
>>> >>>>>>>>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> --
>>> >>>>>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> http://www.lucidimagination.com
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> - Mark Miller
>>> >>>>>>>>>>>>> lucidimagination.com
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> - Mark Miller
>>> >>>>>>>>>>> lucidimagination.com
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> - Mark Miller
>>> >>>>>>>> lucidimagination.com
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>
>>> >>>>>> - Mark Miller
>>> >>>>>> lucidimagination.com
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> - Mark
>>> >>>
>>> >>> http://www.lucidimagination.com
>>> >>>
>>> >
>>> > - Mark Miller
>>> > lucidimagination.com
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

How does it determine the number of shards to create?  How many
replicas to create?

On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller <ma...@gmail.com> wrote:
> Ah, okay - you are setting the shards in solr.xml - thats still an option
> to force a node to a particular shard - but if you take that out, shards
> will be auto assigned.
>
> By the way, because of the version code, distrib deletes don't work at the
> moment - will get to that next week.
>
> - Mark
>
> On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> So I'm a fool.  I did set the numShards, the issue was so trivial it's
>> embarrassing.  I did indeed have it setup as a replica, the shard
>> names in solr.xml were both shard1.  This worked as I expected now.
>>
>> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com> wrote:
>> >
>> > They are unused params, so removing them wouldn't help anything.
>> >
>> > You might just want to wait till we are further along before playing
>> with it.
>> >
>> > Or if you submit your full self contained test, I can see what's going
>> on (eg its still unclear if you have started setting numShards?).
>> >
>> > I can do a similar set of actions in my tests and it works fine. The
>> only reason I could see things working like this is if it thinks you have
>> one shard - a leader and a replica.
>> >
>> > - Mark
>> >
>> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
>> >
>> >> Glad to hear I don't need to set shards/self, but removing them didn't
>> >> seem to change what I'm seeing.  Doing this still results in 2
>> >> documents 1 on 8983 and 1 on 7574.
>> >>
>> >> String key = "1";
>> >>
>> >>               SolrInputDocument solrDoc = new SolrInputDocument();
>> >>               solrDoc.setField("key", key);
>> >>
>> >>               solrDoc.addField("content_mvtxt", "initial value");
>> >>
>> >>               SolrServer server = servers.get("
>> http://localhost:8983/solr/collection1");
>> >>
>> >>               UpdateRequest ureq = new UpdateRequest();
>> >>               ureq.setParam("update.chain", "distrib-update-chain");
>> >>               ureq.add(solrDoc);
>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>> >>               server.request(ureq);
>> >>               server.commit();
>> >>
>> >>               solrDoc = new SolrInputDocument();
>> >>               solrDoc.addField("key", key);
>> >>               solrDoc.addField("content_mvtxt", "updated value");
>> >>
>> >>               server = servers.get("
>> http://localhost:7574/solr/collection1");
>> >>
>> >>               ureq = new UpdateRequest();
>> >>               ureq.setParam("update.chain", "distrib-update-chain");
>> >>               ureq.add(solrDoc);
>> >>               ureq.setAction(ACTION.COMMIT, true, true);
>> >>               server.request(ureq);
>> >>               server.commit();
>> >>
>> >>               server = servers.get("
>> http://localhost:8983/solr/collection1");
>> >>
>> >>
>> >>               server.commit();
>> >>               System.out.println("done");
>> >>
>> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>> So I dunno. You are running a zk server and running in zk mode right?
>> >>>
>> >>> You don't need to / shouldn't set a shards or self param. The shards
>> are
>> >>> figured out from Zookeeper.
>> >>>
>> >>> You always want to use the distrib-update-chain. Eventually it will
>> >>> probably be part of the default chain and auto turn in zk mode.
>> >>>
>> >>> If you are running in zk mode attached to a zk server, this should
>> work no
>> >>> problem. You can add docs to any server and they will be forwarded to
>> the
>> >>> correct shard leader and then versioned and forwarded to replicas.
>> >>>
>> >>> You can also use the CloudSolrServer solrj client - that way you don't
>> even
>> >>> have to choose a server to send docs too - in which case if it went
>> down
>> >>> you would have to choose another manually - CloudSolrServer
>> automatically
>> >>> finds one that is up through ZooKeeper. Eventually it will also be
>> smart
>> >>> and do the hashing itself so that it can send directly to the shard
>> leader
>> >>> that the doc would be forwarded to anyway.
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>
>> >>>> Really just trying to do a simple add and update test, the chain
>> >>>> missing is just proof of my not understanding exactly how this is
>> >>>> supposed to work.  I modified the code to this
>> >>>>
>> >>>>                String key = "1";
>> >>>>
>> >>>>                SolrInputDocument solrDoc = new SolrInputDocument();
>> >>>>                solrDoc.setField("key", key);
>> >>>>
>> >>>>                 solrDoc.addField("content_mvtxt", "initial value");
>> >>>>
>> >>>>                SolrServer server = servers
>> >>>>                                .get("
>> >>>> http://localhost:8983/solr/collection1");
>> >>>>
>> >>>>                 UpdateRequest ureq = new UpdateRequest();
>> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
>> >>>>                ureq.add(solrDoc);
>> >>>>                ureq.setParam("shards",
>> >>>>
>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>>                ureq.setParam("self", "foo");
>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>> >>>>                server.request(ureq);
>> >>>>                 server.commit();
>> >>>>
>> >>>>                solrDoc = new SolrInputDocument();
>> >>>>                solrDoc.addField("key", key);
>> >>>>                 solrDoc.addField("content_mvtxt", "updated value");
>> >>>>
>> >>>>                server = servers.get("
>> >>>> http://localhost:7574/solr/collection1");
>> >>>>
>> >>>>                 ureq = new UpdateRequest();
>> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
>> >>>>                 //
>> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>> >>>>                 ureq.add(solrDoc);
>> >>>>                ureq.setParam("shards",
>> >>>>
>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>>                ureq.setParam("self", "foo");
>> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
>> >>>>                server.request(ureq);
>> >>>>                 // server.add(solrDoc);
>> >>>>                server.commit();
>> >>>>                server = servers.get("
>> >>>> http://localhost:8983/solr/collection1");
>> >>>>
>> >>>>
>> >>>>                server.commit();
>> >>>>                System.out.println("done");
>> >>>>
>> >>>> but I'm still seeing the doc appear on both shards.    After the first
>> >>>> commit I see the doc on 8983 with "initial value".  after the second
>> >>>> commit I see the updated value on 7574 and the old on 8983.  After the
>> >>>> final commit the doc on 8983 gets updated.
>> >>>>
>> >>>> Is there something wrong with my test?
>> >>>>
>> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
>> >>>> wrote:
>> >>>>> Getting late - didn't really pay attention to your code I guess - why
>> >>>> are you adding the first doc without specifying the distrib update
>> chain?
>> >>>> This is not really supported. It's going to just go to the server you
>> >>>> specified - even with everything setup right, the update might then
>> go to
>> >>>> that same server or the other one depending on how it hashes. You
>> really
>> >>>> want to just always use the distrib update chain.  I guess I don't yet
>> >>>> understand what you are trying to test.
>> >>>>>
>> >>>>> Sent from my iPad
>> >>>>>
>> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Not sure offhand - but things will be funky if you don't specify the
>> >>>> correct numShards.
>> >>>>>>
>> >>>>>> The instance to shard assignment should be using numShards to
>> assign.
>> >>>> But then the hash to shard mapping actually goes on the number of
>> shards it
>> >>>> finds registered in ZK (it doesn't have to, but really these should be
>> >>>> equal).
>> >>>>>>
>> >>>>>> So basically you are saying, I want 3 partitions, but you are only
>> >>>> starting up 2 nodes, and the code is just not happy about that I'd
>> guess.
>> >>>> For the system to work properly, you have to fire up at least as many
>> >>>> servers as numShards.
>> >>>>>>
>> >>>>>> What are you trying to do? 2 partitions with no replicas, or one
>> >>>> partition with one replica?
>> >>>>>>
>> >>>>>> In either case, I think you will have better luck if you fire up at
>> >>>> least as many servers as the numShards setting. Or lower the numShards
>> >>>> setting.
>> >>>>>>
>> >>>>>> This is all a work in progress by the way - what you are trying to
>> test
>> >>>> should work if things are setup right though.
>> >>>>>>
>> >>>>>> - Mark
>> >>>>>>
>> >>>>>>
>> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>> >>>>>>
>> >>>>>>> Thanks for the quick response.  With that change (have not done
>> >>>>>>> numShards yet) shard1 got updated.  But now when executing the
>> >>>>>>> following queries I get information back from both, which doesn't
>> seem
>> >>>>>>> right
>> >>>>>>>
>> >>>>>>> http://localhost:7574/solr/select/?q=*:*
>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> >>>> value</str></doc>
>> >>>>>>>
>> >>>>>>> http://localhost:8983/solr/select?q=*:*
>> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> >>>> value</str></doc>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
>> markrmiller@gmail.com>
>> >>>> wrote:
>> >>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
>> >>>> not distributing a commit (easy to add, just have not done it).
>> >>>>>>>>
>> >>>>>>>> Right now I explicitly commit on each server for tests.
>> >>>>>>>>
>> >>>>>>>> Can you try explicitly committing on server1 after updating the
>> doc
>> >>>> on server 2?
>> >>>>>>>>
>> >>>>>>>> I can start distributing commits tomorrow - been meaning to do it
>> for
>> >>>> my own convenience anyhow.
>> >>>>>>>>
>> >>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
>> >>>> think it defaults to 3. That will give you one leader and one replica.
>> >>>>>>>>
>> >>>>>>>> - Mark
>> >>>>>>>>
>> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>> >>>>>>>>
>> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2
>> shard
>> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated
>> it
>> >>>>>>>>> and sent the update to the other.  I don't see the modifications
>> >>>>>>>>> though I only see the original document.  The following is the
>> test
>> >>>>>>>>>
>> >>>>>>>>> public void update() throws Exception {
>> >>>>>>>>>
>> >>>>>>>>>              String key = "1";
>> >>>>>>>>>
>> >>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>> >>>>>>>>>              solrDoc.setField("key", key);
>> >>>>>>>>>
>> >>>>>>>>>              solrDoc.addField("content", "initial value");
>> >>>>>>>>>
>> >>>>>>>>>              SolrServer server = servers
>> >>>>>>>>>                              .get("
>> >>>> http://localhost:8983/solr/collection1");
>> >>>>>>>>>              server.add(solrDoc);
>> >>>>>>>>>
>> >>>>>>>>>              server.commit();
>> >>>>>>>>>
>> >>>>>>>>>              solrDoc = new SolrInputDocument();
>> >>>>>>>>>              solrDoc.addField("key", key);
>> >>>>>>>>>              solrDoc.addField("content", "updated value");
>> >>>>>>>>>
>> >>>>>>>>>              server = servers.get("
>> >>>> http://localhost:7574/solr/collection1");
>> >>>>>>>>>
>> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>> >>>>>>>>>              ureq.setParam("update.chain",
>> "distrib-update-chain");
>> >>>>>>>>>              ureq.add(solrDoc);
>> >>>>>>>>>              ureq.setParam("shards",
>> >>>>>>>>>
>> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>>>>>>>              ureq.setParam("self", "foo");
>> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>> >>>>>>>>>              server.request(ureq);
>> >>>>>>>>>              System.out.println("done");
>> >>>>>>>>>      }
>> >>>>>>>>>
>> >>>>>>>>> key is my unique field in schema.xml
>> >>>>>>>>>
>> >>>>>>>>> What am I doing wrong?
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2003@gmail.com
>> >
>> >>>> wrote:
>> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>> >>>> would
>> >>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
>> >>>>>>>>>> currently on the list of things to accomplish?  I don't have
>> time to
>> >>>>>>>>>> work on this now, but if you (or anyone) could provide
>> direction I'd
>> >>>>>>>>>> be willing to work on this when I had spare time.  I guess a
>> JIRA
>> >>>>>>>>>> detailing where/how to do this could help.  Not sure if the
>> design
>> >>>> has
>> >>>>>>>>>> been thought out that far though.
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
>> markrmiller@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>>> Right now lets say you have one shard - everything there
>> hashes to
>> >>>> range X.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>> >>>>>>>>>>>
>> >>>>>>>>>>> You divide range X in two - giving you two ranges - then you
>> start
>> >>>> splitting. This is where the current Splitter needs a little
>> modification.
>> >>>> You decide which doc should go into which new index by rehashing each
>> doc
>> >>>> id in the index you are splitting - if its hash is greater than X/2,
>> it
>> >>>> goes into index1 - if its less, index2. I think there are a couple
>> current
>> >>>> Splitter impls, but one of them does something like, give me an id -
>> now if
>> >>>> the id's in the index are above that id, goto index1, if below,
>> index2. We
>> >>>> need to instead do a quick hash rather than simple id compare.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Why do you need to do this on every shard?
>> >>>>>>>>>>>
>> >>>>>>>>>>> The other part we need that we dont have is to store hash range
>> >>>> assignments in zookeeper - we don't do that yet because it's not
>> needed
>> >>>> yet. Instead we currently just simply calculate that on the fly (too
>> often
>> >>>> at the moment - on every request :) I intend to fix that of course).
>> >>>>>>>>>>>
>> >>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
>> >>>> the split, it would say, for range less than X/2 goto the old node,
>> for
>> >>>> range greater than X/2 goto the new node.
>> >>>>>>>>>>>
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's
>> on
>> >>>> the
>> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like
>> there
>> >>>> is
>> >>>>>>>>>>>> some logic which is able to tell that a particular range
>> should be
>> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a
>> trade
>> >>>> off
>> >>>>>>>>>>>> between repartitioning the entire index (on every shard) and
>> >>>> having a
>> >>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
>> >>>> where 2
>> >>>>>>>>>>>> or more shards map to a particular range.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>> >>>> markrmiller@gmail.com> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
>> contrib,
>> >>>> but I'll
>> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it
>> would be
>> >>>> to run
>> >>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
>> >>>> algorithm.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I
>> think
>> >>>> the idea would be to split as many as you felt you needed to.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> If you wanted to keep the full balance always, this would
>> mean
>> >>>> splitting every shard at once, yes. But this depends on how many boxes
>> >>>> (partitions) you are willing/able to add at a time.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> You might just split one index to start - now it's hash range
>> >>>> would be handled by two shards instead of one (if you have 3 replicas
>> per
>> >>>> shard, this would mean adding 3 more boxes). When you needed to expand
>> >>>> again, you would split another index that was still handling its full
>> >>>> starting range. As you grow, once you split every original index,
>> you'd
>> >>>> start again, splitting one of the now half ranges.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Is there also an index merger in contrib which could be
>> used to
>> >>>> merge
>> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>> >>>> admin command that can do this). But I'm not sure where this fits in?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>> >>>> markrmiller@gmail.com> wrote:
>> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>> >>>> other stuff is
>> >>>>>>>>>>>>>>> working solid at this point. But someone else could jump
>> in!
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> A more long term solution may be to start using micro
>> shards -
>> >>>> each index
>> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to
>> move
>> >>>> mirco shards
>> >>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
>> >>>> flexible as you
>> >>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>> >>>> splitter . We
>> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
>> >>>> modify it so
>> >>>>>>>>>>>>>>> that it splits based on the hash of the document id. This
>> is
>> >>>> super
>> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while
>> on
>> >>>> a huge index.
>> >>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
>> >>>> enough to start
>> >>>>>>>>>>>>>>> with, but most files under codec control these days, we
>> may be
>> >>>> able to make
>> >>>>>>>>>>>>>>> a single pass splitter soon as well.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Eventually you could imagine using both options - micro
>> shards
>> >>>> that could
>> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro
>> shards
>> >>>> will be
>> >>>>>>>>>>>>>>> worth the extra complications myself...
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
>> >>>> number of
>> >>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
>> >>>> replicas is
>> >>>>>>>>>>>>>>> trivial though.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>> >>>> jej2003@gmail.com> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Another question, is there any support for repartitioning
>> of
>> >>>> the index
>> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
>> approach for
>> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>> >>>> probably
>> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a
>> new
>> >>>> shard be
>> >>>>>>>>>>>>>>>> added.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>> >>>> jej2003@gmail.com> wrote:
>> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>> >>>> markrmiller@gmail.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>> >>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch
>> and
>> >>>> was
>> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
>> configuring the
>> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>> >>>> solrconfig.xml needs
>> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml
>> in
>> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
>> replication
>> >>>> handler def,
>> >>>>>>>>>>>>>>>>>> and an update chain with
>> >>>> solr.DistributedUpdateProcessFactory in it.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> http://www.lucidimagination.com
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> - Mark Miller
>> >>>>>>>>>>>>> lucidimagination.com
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> - Mark Miller
>> >>>>>>>>>>> lucidimagination.com
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> - Mark Miller
>> >>>>>>>> lucidimagination.com
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>
>> >>>>>> - Mark Miller
>> >>>>>> lucidimagination.com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> - Mark
>> >>>
>> >>> http://www.lucidimagination.com
>> >>>
>> >
>> > - Mark Miller
>> > lucidimagination.com
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Ah, okay - you are setting the shards in solr.xml - thats still an option
to force a node to a particular shard - but if you take that out, shards
will be auto assigned.

By the way, because of the version code, distrib deletes don't work at the
moment - will get to that next week.

- Mark

On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson <je...@gmail.com> wrote:

> So I'm a fool.  I did set the numShards, the issue was so trivial it's
> embarrassing.  I did indeed have it setup as a replica, the shard
> names in solr.xml were both shard1.  This worked as I expected now.
>
> On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com> wrote:
> >
> > They are unused params, so removing them wouldn't help anything.
> >
> > You might just want to wait till we are further along before playing
> with it.
> >
> > Or if you submit your full self contained test, I can see what's going
> on (eg its still unclear if you have started setting numShards?).
> >
> > I can do a similar set of actions in my tests and it works fine. The
> only reason I could see things working like this is if it thinks you have
> one shard - a leader and a replica.
> >
> > - Mark
> >
> > On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
> >
> >> Glad to hear I don't need to set shards/self, but removing them didn't
> >> seem to change what I'm seeing.  Doing this still results in 2
> >> documents 1 on 8983 and 1 on 7574.
> >>
> >> String key = "1";
> >>
> >>               SolrInputDocument solrDoc = new SolrInputDocument();
> >>               solrDoc.setField("key", key);
> >>
> >>               solrDoc.addField("content_mvtxt", "initial value");
> >>
> >>               SolrServer server = servers.get("
> http://localhost:8983/solr/collection1");
> >>
> >>               UpdateRequest ureq = new UpdateRequest();
> >>               ureq.setParam("update.chain", "distrib-update-chain");
> >>               ureq.add(solrDoc);
> >>               ureq.setAction(ACTION.COMMIT, true, true);
> >>               server.request(ureq);
> >>               server.commit();
> >>
> >>               solrDoc = new SolrInputDocument();
> >>               solrDoc.addField("key", key);
> >>               solrDoc.addField("content_mvtxt", "updated value");
> >>
> >>               server = servers.get("
> http://localhost:7574/solr/collection1");
> >>
> >>               ureq = new UpdateRequest();
> >>               ureq.setParam("update.chain", "distrib-update-chain");
> >>               ureq.add(solrDoc);
> >>               ureq.setAction(ACTION.COMMIT, true, true);
> >>               server.request(ureq);
> >>               server.commit();
> >>
> >>               server = servers.get("
> http://localhost:8983/solr/collection1");
> >>
> >>
> >>               server.commit();
> >>               System.out.println("done");
> >>
> >> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com>
> wrote:
> >>> So I dunno. You are running a zk server and running in zk mode right?
> >>>
> >>> You don't need to / shouldn't set a shards or self param. The shards
> are
> >>> figured out from Zookeeper.
> >>>
> >>> You always want to use the distrib-update-chain. Eventually it will
> >>> probably be part of the default chain and auto turn in zk mode.
> >>>
> >>> If you are running in zk mode attached to a zk server, this should
> work no
> >>> problem. You can add docs to any server and they will be forwarded to
> the
> >>> correct shard leader and then versioned and forwarded to replicas.
> >>>
> >>> You can also use the CloudSolrServer solrj client - that way you don't
> even
> >>> have to choose a server to send docs too - in which case if it went
> down
> >>> you would have to choose another manually - CloudSolrServer
> automatically
> >>> finds one that is up through ZooKeeper. Eventually it will also be
> smart
> >>> and do the hashing itself so that it can send directly to the shard
> leader
> >>> that the doc would be forwarded to anyway.
> >>>
> >>> - Mark
> >>>
> >>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>
> >>>> Really just trying to do a simple add and update test, the chain
> >>>> missing is just proof of my not understanding exactly how this is
> >>>> supposed to work.  I modified the code to this
> >>>>
> >>>>                String key = "1";
> >>>>
> >>>>                SolrInputDocument solrDoc = new SolrInputDocument();
> >>>>                solrDoc.setField("key", key);
> >>>>
> >>>>                 solrDoc.addField("content_mvtxt", "initial value");
> >>>>
> >>>>                SolrServer server = servers
> >>>>                                .get("
> >>>> http://localhost:8983/solr/collection1");
> >>>>
> >>>>                 UpdateRequest ureq = new UpdateRequest();
> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
> >>>>                ureq.add(solrDoc);
> >>>>                ureq.setParam("shards",
> >>>>
> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>>                ureq.setParam("self", "foo");
> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
> >>>>                server.request(ureq);
> >>>>                 server.commit();
> >>>>
> >>>>                solrDoc = new SolrInputDocument();
> >>>>                solrDoc.addField("key", key);
> >>>>                 solrDoc.addField("content_mvtxt", "updated value");
> >>>>
> >>>>                server = servers.get("
> >>>> http://localhost:7574/solr/collection1");
> >>>>
> >>>>                 ureq = new UpdateRequest();
> >>>>                ureq.setParam("update.chain", "distrib-update-chain");
> >>>>                 //
> ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
> >>>>                 ureq.add(solrDoc);
> >>>>                ureq.setParam("shards",
> >>>>
> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>>                ureq.setParam("self", "foo");
> >>>>                ureq.setAction(ACTION.COMMIT, true, true);
> >>>>                server.request(ureq);
> >>>>                 // server.add(solrDoc);
> >>>>                server.commit();
> >>>>                server = servers.get("
> >>>> http://localhost:8983/solr/collection1");
> >>>>
> >>>>
> >>>>                server.commit();
> >>>>                System.out.println("done");
> >>>>
> >>>> but I'm still seeing the doc appear on both shards.    After the first
> >>>> commit I see the doc on 8983 with "initial value".  after the second
> >>>> commit I see the updated value on 7574 and the old on 8983.  After the
> >>>> final commit the doc on 8983 gets updated.
> >>>>
> >>>> Is there something wrong with my test?
> >>>>
> >>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
> >>>> wrote:
> >>>>> Getting late - didn't really pay attention to your code I guess - why
> >>>> are you adding the first doc without specifying the distrib update
> chain?
> >>>> This is not really supported. It's going to just go to the server you
> >>>> specified - even with everything setup right, the update might then
> go to
> >>>> that same server or the other one depending on how it hashes. You
> really
> >>>> want to just always use the distrib update chain.  I guess I don't yet
> >>>> understand what you are trying to test.
> >>>>>
> >>>>> Sent from my iPad
> >>>>>
> >>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Not sure offhand - but things will be funky if you don't specify the
> >>>> correct numShards.
> >>>>>>
> >>>>>> The instance to shard assignment should be using numShards to
> assign.
> >>>> But then the hash to shard mapping actually goes on the number of
> shards it
> >>>> finds registered in ZK (it doesn't have to, but really these should be
> >>>> equal).
> >>>>>>
> >>>>>> So basically you are saying, I want 3 partitions, but you are only
> >>>> starting up 2 nodes, and the code is just not happy about that I'd
> guess.
> >>>> For the system to work properly, you have to fire up at least as many
> >>>> servers as numShards.
> >>>>>>
> >>>>>> What are you trying to do? 2 partitions with no replicas, or one
> >>>> partition with one replica?
> >>>>>>
> >>>>>> In either case, I think you will have better luck if you fire up at
> >>>> least as many servers as the numShards setting. Or lower the numShards
> >>>> setting.
> >>>>>>
> >>>>>> This is all a work in progress by the way - what you are trying to
> test
> >>>> should work if things are setup right though.
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>>
> >>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
> >>>>>>
> >>>>>>> Thanks for the quick response.  With that change (have not done
> >>>>>>> numShards yet) shard1 got updated.  But now when executing the
> >>>>>>> following queries I get information back from both, which doesn't
> seem
> >>>>>>> right
> >>>>>>>
> >>>>>>> http://localhost:7574/solr/select/?q=*:*
> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> >>>> value</str></doc>
> >>>>>>>
> >>>>>>> http://localhost:8983/solr/select?q=*:*
> >>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> >>>> value</str></doc>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <
> markrmiller@gmail.com>
> >>>> wrote:
> >>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
> >>>> not distributing a commit (easy to add, just have not done it).
> >>>>>>>>
> >>>>>>>> Right now I explicitly commit on each server for tests.
> >>>>>>>>
> >>>>>>>> Can you try explicitly committing on server1 after updating the
> doc
> >>>> on server 2?
> >>>>>>>>
> >>>>>>>> I can start distributing commits tomorrow - been meaning to do it
> for
> >>>> my own convenience anyhow.
> >>>>>>>>
> >>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
> >>>> think it defaults to 3. That will give you one leader and one replica.
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
> >>>>>>>>
> >>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
> >>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2
> shard
> >>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated
> it
> >>>>>>>>> and sent the update to the other.  I don't see the modifications
> >>>>>>>>> though I only see the original document.  The following is the
> test
> >>>>>>>>>
> >>>>>>>>> public void update() throws Exception {
> >>>>>>>>>
> >>>>>>>>>              String key = "1";
> >>>>>>>>>
> >>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
> >>>>>>>>>              solrDoc.setField("key", key);
> >>>>>>>>>
> >>>>>>>>>              solrDoc.addField("content", "initial value");
> >>>>>>>>>
> >>>>>>>>>              SolrServer server = servers
> >>>>>>>>>                              .get("
> >>>> http://localhost:8983/solr/collection1");
> >>>>>>>>>              server.add(solrDoc);
> >>>>>>>>>
> >>>>>>>>>              server.commit();
> >>>>>>>>>
> >>>>>>>>>              solrDoc = new SolrInputDocument();
> >>>>>>>>>              solrDoc.addField("key", key);
> >>>>>>>>>              solrDoc.addField("content", "updated value");
> >>>>>>>>>
> >>>>>>>>>              server = servers.get("
> >>>> http://localhost:7574/solr/collection1");
> >>>>>>>>>
> >>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
> >>>>>>>>>              ureq.setParam("update.chain",
> "distrib-update-chain");
> >>>>>>>>>              ureq.add(solrDoc);
> >>>>>>>>>              ureq.setParam("shards",
> >>>>>>>>>
> >>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>>>>>>>              ureq.setParam("self", "foo");
> >>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
> >>>>>>>>>              server.request(ureq);
> >>>>>>>>>              System.out.println("done");
> >>>>>>>>>      }
> >>>>>>>>>
> >>>>>>>>> key is my unique field in schema.xml
> >>>>>>>>>
> >>>>>>>>> What am I doing wrong?
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>> wrote:
> >>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
> >>>> would
> >>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
> >>>>>>>>>> currently on the list of things to accomplish?  I don't have
> time to
> >>>>>>>>>> work on this now, but if you (or anyone) could provide
> direction I'd
> >>>>>>>>>> be willing to work on this when I had spare time.  I guess a
> JIRA
> >>>>>>>>>> detailing where/how to do this could help.  Not sure if the
> design
> >>>> has
> >>>>>>>>>> been thought out that far though.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <
> markrmiller@gmail.com>
> >>>> wrote:
> >>>>>>>>>>> Right now lets say you have one shard - everything there
> hashes to
> >>>> range X.
> >>>>>>>>>>>
> >>>>>>>>>>> Now you want to split that shard with an Index Splitter.
> >>>>>>>>>>>
> >>>>>>>>>>> You divide range X in two - giving you two ranges - then you
> start
> >>>> splitting. This is where the current Splitter needs a little
> modification.
> >>>> You decide which doc should go into which new index by rehashing each
> doc
> >>>> id in the index you are splitting - if its hash is greater than X/2,
> it
> >>>> goes into index1 - if its less, index2. I think there are a couple
> current
> >>>> Splitter impls, but one of them does something like, give me an id -
> now if
> >>>> the id's in the index are above that id, goto index1, if below,
> index2. We
> >>>> need to instead do a quick hash rather than simple id compare.
> >>>>>>>>>>>
> >>>>>>>>>>> Why do you need to do this on every shard?
> >>>>>>>>>>>
> >>>>>>>>>>> The other part we need that we dont have is to store hash range
> >>>> assignments in zookeeper - we don't do that yet because it's not
> needed
> >>>> yet. Instead we currently just simply calculate that on the fly (too
> often
> >>>> at the moment - on every request :) I intend to fix that of course).
> >>>>>>>>>>>
> >>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
> >>>> the split, it would say, for range less than X/2 goto the old node,
> for
> >>>> range greater than X/2 goto the new node.
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's
> on
> >>>> the
> >>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like
> there
> >>>> is
> >>>>>>>>>>>> some logic which is able to tell that a particular range
> should be
> >>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a
> trade
> >>>> off
> >>>>>>>>>>>> between repartitioning the entire index (on every shard) and
> >>>> having a
> >>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
> >>>> where 2
> >>>>>>>>>>>> or more shards map to a particular range.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
> >>>> markrmiller@gmail.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am not familiar with the index splitter that is in
> contrib,
> >>>> but I'll
> >>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it
> would be
> >>>> to run
> >>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
> >>>> algorithm.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I
> think
> >>>> the idea would be to split as many as you felt you needed to.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If you wanted to keep the full balance always, this would
> mean
> >>>> splitting every shard at once, yes. But this depends on how many boxes
> >>>> (partitions) you are willing/able to add at a time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You might just split one index to start - now it's hash range
> >>>> would be handled by two shards instead of one (if you have 3 replicas
> per
> >>>> shard, this would mean adding 3 more boxes). When you needed to expand
> >>>> again, you would split another index that was still handling its full
> >>>> starting range. As you grow, once you split every original index,
> you'd
> >>>> start again, splitting one of the now half ranges.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Is there also an index merger in contrib which could be
> used to
> >>>> merge
> >>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
> >>>> admin command that can do this). But I'm not sure where this fits in?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
> >>>> markrmiller@gmail.com> wrote:
> >>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
> >>>> other stuff is
> >>>>>>>>>>>>>>> working solid at this point. But someone else could jump
> in!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A more long term solution may be to start using micro
> shards -
> >>>> each index
> >>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to
> move
> >>>> mirco shards
> >>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
> >>>> flexible as you
> >>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A more simple and likely first step is to use an index
> >>>> splitter . We
> >>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
> >>>> modify it so
> >>>>>>>>>>>>>>> that it splits based on the hash of the document id. This
> is
> >>>> super
> >>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while
> on
> >>>> a huge index.
> >>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
> >>>> enough to start
> >>>>>>>>>>>>>>> with, but most files under codec control these days, we
> may be
> >>>> able to make
> >>>>>>>>>>>>>>> a single pass splitter soon as well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Eventually you could imagine using both options - micro
> shards
> >>>> that could
> >>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro
> shards
> >>>> will be
> >>>>>>>>>>>>>>> worth the extra complications myself...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
> >>>> number of
> >>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
> >>>> replicas is
> >>>>>>>>>>>>>>> trivial though.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
> >>>> jej2003@gmail.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another question, is there any support for repartitioning
> of
> >>>> the index
> >>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended
> approach for
> >>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
> >>>> probably
> >>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a
> new
> >>>> shard be
> >>>>>>>>>>>>>>>> added.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
> >>>> jej2003@gmail.com> wrote:
> >>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
> >>>> markrmiller@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
> >>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch
> and
> >>>> was
> >>>>>>>>>>>>>>>>>>> wondering if there was any documentation on
> configuring the
> >>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
> >>>> solrconfig.xml needs
> >>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml
> in
> >>>>>>>>>>>>>>>>>> solr/core/src/test-files
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty
> replication
> >>>> handler def,
> >>>>>>>>>>>>>>>>>> and an update chain with
> >>>> solr.DistributedUpdateProcessFactory in it.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Mark Miller
> >>>>>>>>>>>>> lucidimagination.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark Miller
> >>>>>>>>>>> lucidimagination.com
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> - Mark Miller
> >>>>>>>> lucidimagination.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> - Mark Miller
> >>>>>> lucidimagination.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> - Mark
> >>>
> >>> http://www.lucidimagination.com
> >>>
> >
> > - Mark Miller
> > lucidimagination.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

So I'm a fool.  I did set the numShards, the issue was so trivial it's
embarrassing.  I did indeed have it setup as a replica, the shard
names in solr.xml were both shard1.  This worked as I expected now.

On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <ma...@gmail.com> wrote:
>
> They are unused params, so removing them wouldn't help anything.
>
> You might just want to wait till we are further along before playing with it.
>
> Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?).
>
> I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica.
>
> - Mark
>
> On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
>
>> Glad to hear I don't need to set shards/self, but removing them didn't
>> seem to change what I'm seeing.  Doing this still results in 2
>> documents 1 on 8983 and 1 on 7574.
>>
>> String key = "1";
>>
>>               SolrInputDocument solrDoc = new SolrInputDocument();
>>               solrDoc.setField("key", key);
>>
>>               solrDoc.addField("content_mvtxt", "initial value");
>>
>>               SolrServer server = servers.get("http://localhost:8983/solr/collection1");
>>
>>               UpdateRequest ureq = new UpdateRequest();
>>               ureq.setParam("update.chain", "distrib-update-chain");
>>               ureq.add(solrDoc);
>>               ureq.setAction(ACTION.COMMIT, true, true);
>>               server.request(ureq);
>>               server.commit();
>>
>>               solrDoc = new SolrInputDocument();
>>               solrDoc.addField("key", key);
>>               solrDoc.addField("content_mvtxt", "updated value");
>>
>>               server = servers.get("http://localhost:7574/solr/collection1");
>>
>>               ureq = new UpdateRequest();
>>               ureq.setParam("update.chain", "distrib-update-chain");
>>               ureq.add(solrDoc);
>>               ureq.setAction(ACTION.COMMIT, true, true);
>>               server.request(ureq);
>>               server.commit();
>>
>>               server = servers.get("http://localhost:8983/solr/collection1");
>>
>>
>>               server.commit();
>>               System.out.println("done");
>>
>> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com> wrote:
>>> So I dunno. You are running a zk server and running in zk mode right?
>>>
>>> You don't need to / shouldn't set a shards or self param. The shards are
>>> figured out from Zookeeper.
>>>
>>> You always want to use the distrib-update-chain. Eventually it will
>>> probably be part of the default chain and auto turn in zk mode.
>>>
>>> If you are running in zk mode attached to a zk server, this should work no
>>> problem. You can add docs to any server and they will be forwarded to the
>>> correct shard leader and then versioned and forwarded to replicas.
>>>
>>> You can also use the CloudSolrServer solrj client - that way you don't even
>>> have to choose a server to send docs too - in which case if it went down
>>> you would have to choose another manually - CloudSolrServer automatically
>>> finds one that is up through ZooKeeper. Eventually it will also be smart
>>> and do the hashing itself so that it can send directly to the shard leader
>>> that the doc would be forwarded to anyway.
>>>
>>> - Mark
>>>
>>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>>> Really just trying to do a simple add and update test, the chain
>>>> missing is just proof of my not understanding exactly how this is
>>>> supposed to work.  I modified the code to this
>>>>
>>>>                String key = "1";
>>>>
>>>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>>>                solrDoc.setField("key", key);
>>>>
>>>>                 solrDoc.addField("content_mvtxt", "initial value");
>>>>
>>>>                SolrServer server = servers
>>>>                                .get("
>>>> http://localhost:8983/solr/collection1");
>>>>
>>>>                 UpdateRequest ureq = new UpdateRequest();
>>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>>                ureq.add(solrDoc);
>>>>                ureq.setParam("shards",
>>>>
>>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>                ureq.setParam("self", "foo");
>>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>>                server.request(ureq);
>>>>                 server.commit();
>>>>
>>>>                solrDoc = new SolrInputDocument();
>>>>                solrDoc.addField("key", key);
>>>>                 solrDoc.addField("content_mvtxt", "updated value");
>>>>
>>>>                server = servers.get("
>>>> http://localhost:7574/solr/collection1");
>>>>
>>>>                 ureq = new UpdateRequest();
>>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>>>                 ureq.add(solrDoc);
>>>>                ureq.setParam("shards",
>>>>
>>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>                ureq.setParam("self", "foo");
>>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>>                server.request(ureq);
>>>>                 // server.add(solrDoc);
>>>>                server.commit();
>>>>                server = servers.get("
>>>> http://localhost:8983/solr/collection1");
>>>>
>>>>
>>>>                server.commit();
>>>>                System.out.println("done");
>>>>
>>>> but I'm still seeing the doc appear on both shards.    After the first
>>>> commit I see the doc on 8983 with "initial value".  after the second
>>>> commit I see the updated value on 7574 and the old on 8983.  After the
>>>> final commit the doc on 8983 gets updated.
>>>>
>>>> Is there something wrong with my test?
>>>>
>>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>> Getting late - didn't really pay attention to your code I guess - why
>>>> are you adding the first doc without specifying the distrib update chain?
>>>> This is not really supported. It's going to just go to the server you
>>>> specified - even with everything setup right, the update might then go to
>>>> that same server or the other one depending on how it hashes. You really
>>>> want to just always use the distrib update chain.  I guess I don't yet
>>>> understand what you are trying to test.
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>
>>>>>> Not sure offhand - but things will be funky if you don't specify the
>>>> correct numShards.
>>>>>>
>>>>>> The instance to shard assignment should be using numShards to assign.
>>>> But then the hash to shard mapping actually goes on the number of shards it
>>>> finds registered in ZK (it doesn't have to, but really these should be
>>>> equal).
>>>>>>
>>>>>> So basically you are saying, I want 3 partitions, but you are only
>>>> starting up 2 nodes, and the code is just not happy about that I'd guess.
>>>> For the system to work properly, you have to fire up at least as many
>>>> servers as numShards.
>>>>>>
>>>>>> What are you trying to do? 2 partitions with no replicas, or one
>>>> partition with one replica?
>>>>>>
>>>>>> In either case, I think you will have better luck if you fire up at
>>>> least as many servers as the numShards setting. Or lower the numShards
>>>> setting.
>>>>>>
>>>>>> This is all a work in progress by the way - what you are trying to test
>>>> should work if things are setup right though.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>>
>>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>>>>>>
>>>>>>> Thanks for the quick response.  With that change (have not done
>>>>>>> numShards yet) shard1 got updated.  But now when executing the
>>>>>>> following queries I get information back from both, which doesn't seem
>>>>>>> right
>>>>>>>
>>>>>>> http://localhost:7574/solr/select/?q=*:*
>>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>>> value</str></doc>
>>>>>>>
>>>>>>> http://localhost:8983/solr/select?q=*:*
>>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>>> value</str></doc>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
>>>> not distributing a commit (easy to add, just have not done it).
>>>>>>>>
>>>>>>>> Right now I explicitly commit on each server for tests.
>>>>>>>>
>>>>>>>> Can you try explicitly committing on server1 after updating the doc
>>>> on server 2?
>>>>>>>>
>>>>>>>> I can start distributing commits tomorrow - been meaning to do it for
>>>> my own convenience anyhow.
>>>>>>>>
>>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
>>>> think it defaults to 3. That will give you one leader and one replica.
>>>>>>>>
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>>>>>>>
>>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>>>>>>>>> and sent the update to the other.  I don't see the modifications
>>>>>>>>> though I only see the original document.  The following is the test
>>>>>>>>>
>>>>>>>>> public void update() throws Exception {
>>>>>>>>>
>>>>>>>>>              String key = "1";
>>>>>>>>>
>>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>>>>>>>>              solrDoc.setField("key", key);
>>>>>>>>>
>>>>>>>>>              solrDoc.addField("content", "initial value");
>>>>>>>>>
>>>>>>>>>              SolrServer server = servers
>>>>>>>>>                              .get("
>>>> http://localhost:8983/solr/collection1");
>>>>>>>>>              server.add(solrDoc);
>>>>>>>>>
>>>>>>>>>              server.commit();
>>>>>>>>>
>>>>>>>>>              solrDoc = new SolrInputDocument();
>>>>>>>>>              solrDoc.addField("key", key);
>>>>>>>>>              solrDoc.addField("content", "updated value");
>>>>>>>>>
>>>>>>>>>              server = servers.get("
>>>> http://localhost:7574/solr/collection1");
>>>>>>>>>
>>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>>>>>>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>>>>>>>>>              ureq.add(solrDoc);
>>>>>>>>>              ureq.setParam("shards",
>>>>>>>>>
>>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>>>>>>              ureq.setParam("self", "foo");
>>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>>>>>>>>              server.request(ureq);
>>>>>>>>>              System.out.println("done");
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> key is my unique field in schema.xml
>>>>>>>>>
>>>>>>>>> What am I doing wrong?
>>>>>>>>>
>>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>>>> would
>>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
>>>>>>>>>> currently on the list of things to accomplish?  I don't have time to
>>>>>>>>>> work on this now, but if you (or anyone) could provide direction I'd
>>>>>>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>>>>>>>>>> detailing where/how to do this could help.  Not sure if the design
>>>> has
>>>>>>>>>> been thought out that far though.
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>>>>>>>> Right now lets say you have one shard - everything there hashes to
>>>> range X.
>>>>>>>>>>>
>>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>>>>>>>>>>>
>>>>>>>>>>> You divide range X in two - giving you two ranges - then you start
>>>> splitting. This is where the current Splitter needs a little modification.
>>>> You decide which doc should go into which new index by rehashing each doc
>>>> id in the index you are splitting - if its hash is greater than X/2, it
>>>> goes into index1 - if its less, index2. I think there are a couple current
>>>> Splitter impls, but one of them does something like, give me an id - now if
>>>> the id's in the index are above that id, goto index1, if below, index2. We
>>>> need to instead do a quick hash rather than simple id compare.
>>>>>>>>>>>
>>>>>>>>>>> Why do you need to do this on every shard?
>>>>>>>>>>>
>>>>>>>>>>> The other part we need that we dont have is to store hash range
>>>> assignments in zookeeper - we don't do that yet because it's not needed
>>>> yet. Instead we currently just simply calculate that on the fly (too often
>>>> at the moment - on every request :) I intend to fix that of course).
>>>>>>>>>>>
>>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
>>>> the split, it would say, for range less than X/2 goto the old node, for
>>>> range greater than X/2 goto the new node.
>>>>>>>>>>>
>>>>>>>>>>> - Mark
>>>>>>>>>>>
>>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>>>>>>>>
>>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
>>>> the
>>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
>>>> is
>>>>>>>>>>>> some logic which is able to tell that a particular range should be
>>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
>>>> off
>>>>>>>>>>>> between repartitioning the entire index (on every shard) and
>>>> having a
>>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
>>>> where 2
>>>>>>>>>>>> or more shards map to a particular range.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>>>> markrmiller@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not familiar with the index splitter that is in contrib,
>>>> but I'll
>>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it would be
>>>> to run
>>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
>>>> algorithm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I think
>>>> the idea would be to split as many as you felt you needed to.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you wanted to keep the full balance always, this would mean
>>>> splitting every shard at once, yes. But this depends on how many boxes
>>>> (partitions) you are willing/able to add at a time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You might just split one index to start - now it's hash range
>>>> would be handled by two shards instead of one (if you have 3 replicas per
>>>> shard, this would mean adding 3 more boxes). When you needed to expand
>>>> again, you would split another index that was still handling its full
>>>> starting range. As you grow, once you split every original index, you'd
>>>> start again, splitting one of the now half ranges.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there also an index merger in contrib which could be used to
>>>> merge
>>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>>>> admin command that can do this). But I'm not sure where this fits in?
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>>>> markrmiller@gmail.com> wrote:
>>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>>>> other stuff is
>>>>>>>>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A more long term solution may be to start using micro shards -
>>>> each index
>>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
>>>> mirco shards
>>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
>>>> flexible as you
>>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>>>> splitter . We
>>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
>>>> modify it so
>>>>>>>>>>>>>>> that it splits based on the hash of the document id. This is
>>>> super
>>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while on
>>>> a huge index.
>>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
>>>> enough to start
>>>>>>>>>>>>>>> with, but most files under codec control these days, we may be
>>>> able to make
>>>>>>>>>>>>>>> a single pass splitter soon as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Eventually you could imagine using both options - micro shards
>>>> that could
>>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
>>>> will be
>>>>>>>>>>>>>>> worth the extra complications myself...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
>>>> number of
>>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
>>>> replicas is
>>>>>>>>>>>>>>> trivial though.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>>>> jej2003@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another question, is there any support for repartitioning of
>>>> the index
>>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>>>> probably
>>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a new
>>>> shard be
>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>>>> jej2003@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>>>> markrmiller@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
>>>> was
>>>>>>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>>>> solrconfig.xml needs
>>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty replication
>>>> handler def,
>>>>>>>>>>>>>>>>>> and an update chain with
>>>> solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Mark Miller
>>>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - Mark Miller
>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> - Mark Miller
>>>>>>>> lucidimagination.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

They are unused params, so removing them wouldn't help anything.

You might just want to wait till we are further along before playing with it.

Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?).

I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica.

- Mark

On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:

> Glad to hear I don't need to set shards/self, but removing them didn't
> seem to change what I'm seeing.  Doing this still results in 2
> documents 1 on 8983 and 1 on 7574.
> 
> String key = "1";
> 
> 		SolrInputDocument solrDoc = new SolrInputDocument();
> 		solrDoc.setField("key", key);
> 
> 		solrDoc.addField("content_mvtxt", "initial value");
> 
> 		SolrServer server = servers.get("http://localhost:8983/solr/collection1");
> 
> 		UpdateRequest ureq = new UpdateRequest();
> 		ureq.setParam("update.chain", "distrib-update-chain");
> 		ureq.add(solrDoc);
> 		ureq.setAction(ACTION.COMMIT, true, true);
> 		server.request(ureq);
> 		server.commit();
> 
> 		solrDoc = new SolrInputDocument();
> 		solrDoc.addField("key", key);
> 		solrDoc.addField("content_mvtxt", "updated value");
> 
> 		server = servers.get("http://localhost:7574/solr/collection1");
> 
> 		ureq = new UpdateRequest();
> 		ureq.setParam("update.chain", "distrib-update-chain");
> 		ureq.add(solrDoc);
> 		ureq.setAction(ACTION.COMMIT, true, true);
> 		server.request(ureq);
> 		server.commit();
> 
> 		server = servers.get("http://localhost:8983/solr/collection1");
> 	
> 
> 		server.commit();
> 		System.out.println("done");
> 
> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com> wrote:
>> So I dunno. You are running a zk server and running in zk mode right?
>> 
>> You don't need to / shouldn't set a shards or self param. The shards are
>> figured out from Zookeeper.
>> 
>> You always want to use the distrib-update-chain. Eventually it will
>> probably be part of the default chain and auto turn in zk mode.
>> 
>> If you are running in zk mode attached to a zk server, this should work no
>> problem. You can add docs to any server and they will be forwarded to the
>> correct shard leader and then versioned and forwarded to replicas.
>> 
>> You can also use the CloudSolrServer solrj client - that way you don't even
>> have to choose a server to send docs too - in which case if it went down
>> you would have to choose another manually - CloudSolrServer automatically
>> finds one that is up through ZooKeeper. Eventually it will also be smart
>> and do the hashing itself so that it can send directly to the shard leader
>> that the doc would be forwarded to anyway.
>> 
>> - Mark
>> 
>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> Really just trying to do a simple add and update test, the chain
>>> missing is just proof of my not understanding exactly how this is
>>> supposed to work.  I modified the code to this
>>> 
>>>                String key = "1";
>>> 
>>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>>                solrDoc.setField("key", key);
>>> 
>>>                 solrDoc.addField("content_mvtxt", "initial value");
>>> 
>>>                SolrServer server = servers
>>>                                .get("
>>> http://localhost:8983/solr/collection1");
>>> 
>>>                 UpdateRequest ureq = new UpdateRequest();
>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>                ureq.add(solrDoc);
>>>                ureq.setParam("shards",
>>> 
>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>                ureq.setParam("self", "foo");
>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>                server.request(ureq);
>>>                 server.commit();
>>> 
>>>                solrDoc = new SolrInputDocument();
>>>                solrDoc.addField("key", key);
>>>                 solrDoc.addField("content_mvtxt", "updated value");
>>> 
>>>                server = servers.get("
>>> http://localhost:7574/solr/collection1");
>>> 
>>>                 ureq = new UpdateRequest();
>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>>                 ureq.add(solrDoc);
>>>                ureq.setParam("shards",
>>> 
>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>                ureq.setParam("self", "foo");
>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>                server.request(ureq);
>>>                 // server.add(solrDoc);
>>>                server.commit();
>>>                server = servers.get("
>>> http://localhost:8983/solr/collection1");
>>> 
>>> 
>>>                server.commit();
>>>                System.out.println("done");
>>> 
>>> but I'm still seeing the doc appear on both shards.    After the first
>>> commit I see the doc on 8983 with "initial value".  after the second
>>> commit I see the updated value on 7574 and the old on 8983.  After the
>>> final commit the doc on 8983 gets updated.
>>> 
>>> Is there something wrong with my test?
>>> 
>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>> Getting late - didn't really pay attention to your code I guess - why
>>> are you adding the first doc without specifying the distrib update chain?
>>> This is not really supported. It's going to just go to the server you
>>> specified - even with everything setup right, the update might then go to
>>> that same server or the other one depending on how it hashes. You really
>>> want to just always use the distrib update chain.  I guess I don't yet
>>> understand what you are trying to test.
>>>> 
>>>> Sent from my iPad
>>>> 
>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com> wrote:
>>>> 
>>>>> Not sure offhand - but things will be funky if you don't specify the
>>> correct numShards.
>>>>> 
>>>>> The instance to shard assignment should be using numShards to assign.
>>> But then the hash to shard mapping actually goes on the number of shards it
>>> finds registered in ZK (it doesn't have to, but really these should be
>>> equal).
>>>>> 
>>>>> So basically you are saying, I want 3 partitions, but you are only
>>> starting up 2 nodes, and the code is just not happy about that I'd guess.
>>> For the system to work properly, you have to fire up at least as many
>>> servers as numShards.
>>>>> 
>>>>> What are you trying to do? 2 partitions with no replicas, or one
>>> partition with one replica?
>>>>> 
>>>>> In either case, I think you will have better luck if you fire up at
>>> least as many servers as the numShards setting. Or lower the numShards
>>> setting.
>>>>> 
>>>>> This is all a work in progress by the way - what you are trying to test
>>> should work if things are setup right though.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> 
>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>>>>> 
>>>>>> Thanks for the quick response.  With that change (have not done
>>>>>> numShards yet) shard1 got updated.  But now when executing the
>>>>>> following queries I get information back from both, which doesn't seem
>>>>>> right
>>>>>> 
>>>>>> http://localhost:7574/solr/select/?q=*:*
>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>> value</str></doc>
>>>>>> 
>>>>>> http://localhost:8983/solr/select?q=*:*
>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>> value</str></doc>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
>>> not distributing a commit (easy to add, just have not done it).
>>>>>>> 
>>>>>>> Right now I explicitly commit on each server for tests.
>>>>>>> 
>>>>>>> Can you try explicitly committing on server1 after updating the doc
>>> on server 2?
>>>>>>> 
>>>>>>> I can start distributing commits tomorrow - been meaning to do it for
>>> my own convenience anyhow.
>>>>>>> 
>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
>>> think it defaults to 3. That will give you one leader and one replica.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>>>>>> 
>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>>>>>>>> and sent the update to the other.  I don't see the modifications
>>>>>>>> though I only see the original document.  The following is the test
>>>>>>>> 
>>>>>>>> public void update() throws Exception {
>>>>>>>> 
>>>>>>>>              String key = "1";
>>>>>>>> 
>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>>>>>>>              solrDoc.setField("key", key);
>>>>>>>> 
>>>>>>>>              solrDoc.addField("content", "initial value");
>>>>>>>> 
>>>>>>>>              SolrServer server = servers
>>>>>>>>                              .get("
>>> http://localhost:8983/solr/collection1");
>>>>>>>>              server.add(solrDoc);
>>>>>>>> 
>>>>>>>>              server.commit();
>>>>>>>> 
>>>>>>>>              solrDoc = new SolrInputDocument();
>>>>>>>>              solrDoc.addField("key", key);
>>>>>>>>              solrDoc.addField("content", "updated value");
>>>>>>>> 
>>>>>>>>              server = servers.get("
>>> http://localhost:7574/solr/collection1");
>>>>>>>> 
>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>>>>>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>>>>>>>>              ureq.add(solrDoc);
>>>>>>>>              ureq.setParam("shards",
>>>>>>>> 
>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>>>>>              ureq.setParam("self", "foo");
>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>>>>>>>              server.request(ureq);
>>>>>>>>              System.out.println("done");
>>>>>>>>      }
>>>>>>>> 
>>>>>>>> key is my unique field in schema.xml
>>>>>>>> 
>>>>>>>> What am I doing wrong?
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>>> would
>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
>>>>>>>>> currently on the list of things to accomplish?  I don't have time to
>>>>>>>>> work on this now, but if you (or anyone) could provide direction I'd
>>>>>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>>>>>>>>> detailing where/how to do this could help.  Not sure if the design
>>> has
>>>>>>>>> been thought out that far though.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>>>>>>>> Right now lets say you have one shard - everything there hashes to
>>> range X.
>>>>>>>>>> 
>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>>>>>>>>>> 
>>>>>>>>>> You divide range X in two - giving you two ranges - then you start
>>> splitting. This is where the current Splitter needs a little modification.
>>> You decide which doc should go into which new index by rehashing each doc
>>> id in the index you are splitting - if its hash is greater than X/2, it
>>> goes into index1 - if its less, index2. I think there are a couple current
>>> Splitter impls, but one of them does something like, give me an id - now if
>>> the id's in the index are above that id, goto index1, if below, index2. We
>>> need to instead do a quick hash rather than simple id compare.
>>>>>>>>>> 
>>>>>>>>>> Why do you need to do this on every shard?
>>>>>>>>>> 
>>>>>>>>>> The other part we need that we dont have is to store hash range
>>> assignments in zookeeper - we don't do that yet because it's not needed
>>> yet. Instead we currently just simply calculate that on the fly (too often
>>> at the moment - on every request :) I intend to fix that of course).
>>>>>>>>>> 
>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
>>> the split, it would say, for range less than X/2 goto the old node, for
>>> range greater than X/2 goto the new node.
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>>>>>>> 
>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
>>> the
>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
>>> is
>>>>>>>>>>> some logic which is able to tell that a particular range should be
>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
>>> off
>>>>>>>>>>> between repartitioning the entire index (on every shard) and
>>> having a
>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
>>> where 2
>>>>>>>>>>> or more shards map to a particular range.
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>>> markrmiller@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I am not familiar with the index splitter that is in contrib,
>>> but I'll
>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it would be
>>> to run
>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
>>> algorithm.
>>>>>>>>>>>> 
>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I think
>>> the idea would be to split as many as you felt you needed to.
>>>>>>>>>>>> 
>>>>>>>>>>>> If you wanted to keep the full balance always, this would mean
>>> splitting every shard at once, yes. But this depends on how many boxes
>>> (partitions) you are willing/able to add at a time.
>>>>>>>>>>>> 
>>>>>>>>>>>> You might just split one index to start - now it's hash range
>>> would be handled by two shards instead of one (if you have 3 replicas per
>>> shard, this would mean adding 3 more boxes). When you needed to expand
>>> again, you would split another index that was still handling its full
>>> starting range. As you grow, once you split every original index, you'd
>>> start again, splitting one of the now half ranges.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Is there also an index merger in contrib which could be used to
>>> merge
>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>>>>>>> 
>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>>> admin command that can do this). But I'm not sure where this fits in?
>>>>>>>>>>>> 
>>>>>>>>>>>> - Mark
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>>> markrmiller@gmail.com> wrote:
>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>>> other stuff is
>>>>>>>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A more long term solution may be to start using micro shards -
>>> each index
>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
>>> mirco shards
>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
>>> flexible as you
>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>>> splitter . We
>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
>>> modify it so
>>>>>>>>>>>>>> that it splits based on the hash of the document id. This is
>>> super
>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while on
>>> a huge index.
>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
>>> enough to start
>>>>>>>>>>>>>> with, but most files under codec control these days, we may be
>>> able to make
>>>>>>>>>>>>>> a single pass splitter soon as well.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Eventually you could imagine using both options - micro shards
>>> that could
>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
>>> will be
>>>>>>>>>>>>>> worth the extra complications myself...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
>>> number of
>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
>>> replicas is
>>>>>>>>>>>>>> trivial though.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>>> jej2003@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Another question, is there any support for repartitioning of
>>> the index
>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>>> probably
>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a new
>>> shard be
>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>>> jej2003@gmail.com> wrote:
>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>>> markrmiller@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>>> jej2003@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
>>> was
>>>>>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>>> solrconfig.xml needs
>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty replication
>>> handler def,
>>>>>>>>>>>>>>>>> and an update chain with
>>> solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> - Mark Miller
>>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - Mark Miller
>>>>>>>>>> lucidimagination.com
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> - Mark Miller
>>>>>>> lucidimagination.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> - Mark
>> 
>> http://www.lucidimagination.com
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Glad to hear I don't need to set shards/self, but removing them didn't
seem to change what I'm seeing.  Doing this still results in 2
documents 1 on 8983 and 1 on 7574.

String key = "1";

		SolrInputDocument solrDoc = new SolrInputDocument();
		solrDoc.setField("key", key);

		solrDoc.addField("content_mvtxt", "initial value");

		SolrServer server = servers.get("http://localhost:8983/solr/collection1");

		UpdateRequest ureq = new UpdateRequest();
		ureq.setParam("update.chain", "distrib-update-chain");
		ureq.add(solrDoc);
		ureq.setAction(ACTION.COMMIT, true, true);
		server.request(ureq);
		server.commit();

		solrDoc = new SolrInputDocument();
		solrDoc.addField("key", key);
		solrDoc.addField("content_mvtxt", "updated value");

		server = servers.get("http://localhost:7574/solr/collection1");

		ureq = new UpdateRequest();
		ureq.setParam("update.chain", "distrib-update-chain");
		ureq.add(solrDoc);
		ureq.setAction(ACTION.COMMIT, true, true);
		server.request(ureq);
		server.commit();

		server = servers.get("http://localhost:8983/solr/collection1");
	

		server.commit();
		System.out.println("done");

On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <ma...@gmail.com> wrote:
> So I dunno. You are running a zk server and running in zk mode right?
>
> You don't need to / shouldn't set a shards or self param. The shards are
> figured out from Zookeeper.
>
> You always want to use the distrib-update-chain. Eventually it will
> probably be part of the default chain and auto turn in zk mode.
>
> If you are running in zk mode attached to a zk server, this should work no
> problem. You can add docs to any server and they will be forwarded to the
> correct shard leader and then versioned and forwarded to replicas.
>
> You can also use the CloudSolrServer solrj client - that way you don't even
> have to choose a server to send docs too - in which case if it went down
> you would have to choose another manually - CloudSolrServer automatically
> finds one that is up through ZooKeeper. Eventually it will also be smart
> and do the hashing itself so that it can send directly to the shard leader
> that the doc would be forwarded to anyway.
>
> - Mark
>
> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com> wrote:
>
>> Really just trying to do a simple add and update test, the chain
>> missing is just proof of my not understanding exactly how this is
>> supposed to work.  I modified the code to this
>>
>>                String key = "1";
>>
>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>                solrDoc.setField("key", key);
>>
>>                 solrDoc.addField("content_mvtxt", "initial value");
>>
>>                SolrServer server = servers
>>                                .get("
>> http://localhost:8983/solr/collection1");
>>
>>                 UpdateRequest ureq = new UpdateRequest();
>>                ureq.setParam("update.chain", "distrib-update-chain");
>>                ureq.add(solrDoc);
>>                ureq.setParam("shards",
>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>                ureq.setParam("self", "foo");
>>                ureq.setAction(ACTION.COMMIT, true, true);
>>                server.request(ureq);
>>                 server.commit();
>>
>>                solrDoc = new SolrInputDocument();
>>                solrDoc.addField("key", key);
>>                 solrDoc.addField("content_mvtxt", "updated value");
>>
>>                server = servers.get("
>> http://localhost:7574/solr/collection1");
>>
>>                 ureq = new UpdateRequest();
>>                ureq.setParam("update.chain", "distrib-update-chain");
>>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>                 ureq.add(solrDoc);
>>                ureq.setParam("shards",
>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>                ureq.setParam("self", "foo");
>>                ureq.setAction(ACTION.COMMIT, true, true);
>>                server.request(ureq);
>>                 // server.add(solrDoc);
>>                server.commit();
>>                server = servers.get("
>> http://localhost:8983/solr/collection1");
>>
>>
>>                server.commit();
>>                System.out.println("done");
>>
>> but I'm still seeing the doc appear on both shards.    After the first
>> commit I see the doc on 8983 with "initial value".  after the second
>> commit I see the updated value on 7574 and the old on 8983.  After the
>> final commit the doc on 8983 gets updated.
>>
>> Is there something wrong with my test?
>>
>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> > Getting late - didn't really pay attention to your code I guess - why
>> are you adding the first doc without specifying the distrib update chain?
>> This is not really supported. It's going to just go to the server you
>> specified - even with everything setup right, the update might then go to
>> that same server or the other one depending on how it hashes. You really
>> want to just always use the distrib update chain.  I guess I don't yet
>> understand what you are trying to test.
>> >
>> > Sent from my iPad
>> >
>> > On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com> wrote:
>> >
>> >> Not sure offhand - but things will be funky if you don't specify the
>> correct numShards.
>> >>
>> >> The instance to shard assignment should be using numShards to assign.
>> But then the hash to shard mapping actually goes on the number of shards it
>> finds registered in ZK (it doesn't have to, but really these should be
>> equal).
>> >>
>> >> So basically you are saying, I want 3 partitions, but you are only
>> starting up 2 nodes, and the code is just not happy about that I'd guess.
>> For the system to work properly, you have to fire up at least as many
>> servers as numShards.
>> >>
>> >> What are you trying to do? 2 partitions with no replicas, or one
>> partition with one replica?
>> >>
>> >> In either case, I think you will have better luck if you fire up at
>> least as many servers as the numShards setting. Or lower the numShards
>> setting.
>> >>
>> >> This is all a work in progress by the way - what you are trying to test
>> should work if things are setup right though.
>> >>
>> >> - Mark
>> >>
>> >>
>> >> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>> >>
>> >>> Thanks for the quick response.  With that change (have not done
>> >>> numShards yet) shard1 got updated.  But now when executing the
>> >>> following queries I get information back from both, which doesn't seem
>> >>> right
>> >>>
>> >>> http://localhost:7574/solr/select/?q=*:*
>> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> value</str></doc>
>> >>>
>> >>> http://localhost:8983/solr/select?q=*:*
>> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> value</str></doc>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>> Hmm...sorry bout that - so my first guess is that right now we are
>> not distributing a commit (easy to add, just have not done it).
>> >>>>
>> >>>> Right now I explicitly commit on each server for tests.
>> >>>>
>> >>>> Can you try explicitly committing on server1 after updating the doc
>> on server 2?
>> >>>>
>> >>>> I can start distributing commits tomorrow - been meaning to do it for
>> my own convenience anyhow.
>> >>>>
>> >>>> Also, you want to pass the sys property numShards=1 on startup. I
>> think it defaults to 3. That will give you one leader and one replica.
>> >>>>
>> >>>> - Mark
>> >>>>
>> >>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>> >>>>
>> >>>>> So I couldn't resist, I attempted to do this tonight, I used the
>> >>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>> >>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>> >>>>> and sent the update to the other.  I don't see the modifications
>> >>>>> though I only see the original document.  The following is the test
>> >>>>>
>> >>>>> public void update() throws Exception {
>> >>>>>
>> >>>>>              String key = "1";
>> >>>>>
>> >>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>> >>>>>              solrDoc.setField("key", key);
>> >>>>>
>> >>>>>              solrDoc.addField("content", "initial value");
>> >>>>>
>> >>>>>              SolrServer server = servers
>> >>>>>                              .get("
>> http://localhost:8983/solr/collection1");
>> >>>>>              server.add(solrDoc);
>> >>>>>
>> >>>>>              server.commit();
>> >>>>>
>> >>>>>              solrDoc = new SolrInputDocument();
>> >>>>>              solrDoc.addField("key", key);
>> >>>>>              solrDoc.addField("content", "updated value");
>> >>>>>
>> >>>>>              server = servers.get("
>> http://localhost:7574/solr/collection1");
>> >>>>>
>> >>>>>              UpdateRequest ureq = new UpdateRequest();
>> >>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>> >>>>>              ureq.add(solrDoc);
>> >>>>>              ureq.setParam("shards",
>> >>>>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>>>              ureq.setParam("self", "foo");
>> >>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>> >>>>>              server.request(ureq);
>> >>>>>              System.out.println("done");
>> >>>>>      }
>> >>>>>
>> >>>>> key is my unique field in schema.xml
>> >>>>>
>> >>>>> What am I doing wrong?
>> >>>>>
>> >>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>> would
>> >>>>>> be simply updating the range assignments in ZK.  Where is this
>> >>>>>> currently on the list of things to accomplish?  I don't have time to
>> >>>>>> work on this now, but if you (or anyone) could provide direction I'd
>> >>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>> >>>>>> detailing where/how to do this could help.  Not sure if the design
>> has
>> >>>>>> been thought out that far though.
>> >>>>>>
>> >>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>>>>> Right now lets say you have one shard - everything there hashes to
>> range X.
>> >>>>>>>
>> >>>>>>> Now you want to split that shard with an Index Splitter.
>> >>>>>>>
>> >>>>>>> You divide range X in two - giving you two ranges - then you start
>> splitting. This is where the current Splitter needs a little modification.
>> You decide which doc should go into which new index by rehashing each doc
>> id in the index you are splitting - if its hash is greater than X/2, it
>> goes into index1 - if its less, index2. I think there are a couple current
>> Splitter impls, but one of them does something like, give me an id - now if
>> the id's in the index are above that id, goto index1, if below, index2. We
>> need to instead do a quick hash rather than simple id compare.
>> >>>>>>>
>> >>>>>>> Why do you need to do this on every shard?
>> >>>>>>>
>> >>>>>>> The other part we need that we dont have is to store hash range
>> assignments in zookeeper - we don't do that yet because it's not needed
>> yet. Instead we currently just simply calculate that on the fly (too often
>> at the moment - on every request :) I intend to fix that of course).
>> >>>>>>>
>> >>>>>>> At the start, zk would say, for range X, goto this shard. After
>> the split, it would say, for range less than X/2 goto the old node, for
>> range greater than X/2 goto the new node.
>> >>>>>>>
>> >>>>>>> - Mark
>> >>>>>>>
>> >>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>> >>>>>>>
>> >>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
>> the
>> >>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
>> is
>> >>>>>>>> some logic which is able to tell that a particular range should be
>> >>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
>> off
>> >>>>>>>> between repartitioning the entire index (on every shard) and
>> having a
>> >>>>>>>> custom hashing algorithm which is able to handle the situation
>> where 2
>> >>>>>>>> or more shards map to a particular range.
>> >>>>>>>>
>> >>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>> markrmiller@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>> >>>>>>>>>
>> >>>>>>>>>> I am not familiar with the index splitter that is in contrib,
>> but I'll
>> >>>>>>>>>> take a look at it soon.  So the process sounds like it would be
>> to run
>> >>>>>>>>>> this on all of the current shards indexes based on the hash
>> algorithm.
>> >>>>>>>>>
>> >>>>>>>>> Not something I've thought deeply about myself yet, but I think
>> the idea would be to split as many as you felt you needed to.
>> >>>>>>>>>
>> >>>>>>>>> If you wanted to keep the full balance always, this would mean
>> splitting every shard at once, yes. But this depends on how many boxes
>> (partitions) you are willing/able to add at a time.
>> >>>>>>>>>
>> >>>>>>>>> You might just split one index to start - now it's hash range
>> would be handled by two shards instead of one (if you have 3 replicas per
>> shard, this would mean adding 3 more boxes). When you needed to expand
>> again, you would split another index that was still handling its full
>> starting range. As you grow, once you split every original index, you'd
>> start again, splitting one of the now half ranges.
>> >>>>>>>>>
>> >>>>>>>>>> Is there also an index merger in contrib which could be used to
>> merge
>> >>>>>>>>>> indexes?  I'm assuming this would be the process?
>> >>>>>>>>>
>> >>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>> admin command that can do this). But I'm not sure where this fits in?
>> >>>>>>>>>
>> >>>>>>>>> - Mark
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>> markrmiller@gmail.com> wrote:
>> >>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>> other stuff is
>> >>>>>>>>>>> working solid at this point. But someone else could jump in!
>> >>>>>>>>>>>
>> >>>>>>>>>>> There are a couple ways to go about it that I know of:
>> >>>>>>>>>>>
>> >>>>>>>>>>> A more long term solution may be to start using micro shards -
>> each index
>> >>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
>> mirco shards
>> >>>>>>>>>>> around as you decide to change partitions. It's also less
>> flexible as you
>> >>>>>>>>>>> are limited by the number of micro shards you start with.
>> >>>>>>>>>>>
>> >>>>>>>>>>> A more simple and likely first step is to use an index
>> splitter . We
>> >>>>>>>>>>> already have one in lucene contrib - we would just need to
>> modify it so
>> >>>>>>>>>>> that it splits based on the hash of the document id. This is
>> super
>> >>>>>>>>>>> flexible, but splitting will obviously take a little while on
>> a huge index.
>> >>>>>>>>>>> The current index splitter is a multi pass splitter - good
>> enough to start
>> >>>>>>>>>>> with, but most files under codec control these days, we may be
>> able to make
>> >>>>>>>>>>> a single pass splitter soon as well.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Eventually you could imagine using both options - micro shards
>> that could
>> >>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
>> will be
>> >>>>>>>>>>> worth the extra complications myself...
>> >>>>>>>>>>>
>> >>>>>>>>>>> Right now though, the idea is that you should pick a good
>> number of
>> >>>>>>>>>>> partitions to start given your expected data ;) Adding more
>> replicas is
>> >>>>>>>>>>> trivial though.
>> >>>>>>>>>>>
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>> jej2003@gmail.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Another question, is there any support for repartitioning of
>> the index
>> >>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>> >>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>> probably
>> >>>>>>>>>>>> any) would require the index to be repartitioned should a new
>> shard be
>> >>>>>>>>>>>> added.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>> jej2003@gmail.com> wrote:
>> >>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>> markrmiller@gmail.com>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
>> was
>> >>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>> >>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>> solrconfig.xml needs
>> >>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>> >>>>>>>>>>>>>> solr/core/src/test-files
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> You need to enable the update log, add an empty replication
>> handler def,
>> >>>>>>>>>>>>>> and an update chain with
>> solr.DistributedUpdateProcessFactory in it.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> http://www.lucidimagination.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> http://www.lucidimagination.com
>> >>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> - Mark Miller
>> >>>>>>>>> lucidimagination.com
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>> - Mark Miller
>> >>>>>>> lucidimagination.com
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>> - Mark Miller
>> >>>> lucidimagination.com
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>
>> >> - Mark Miller
>> >> lucidimagination.com
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>>
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

So I dunno. You are running a zk server and running in zk mode right?

You don't need to / shouldn't set a shards or self param. The shards are
figured out from Zookeeper.

You always want to use the distrib-update-chain. Eventually it will
probably be part of the default chain and auto turn in zk mode.

If you are running in zk mode attached to a zk server, this should work no
problem. You can add docs to any server and they will be forwarded to the
correct shard leader and then versioned and forwarded to replicas.

You can also use the CloudSolrServer solrj client - that way you don't even
have to choose a server to send docs too - in which case if it went down
you would have to choose another manually - CloudSolrServer automatically
finds one that is up through ZooKeeper. Eventually it will also be smart
and do the hashing itself so that it can send directly to the shard leader
that the doc would be forwarded to anyway.

- Mark

On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <je...@gmail.com> wrote:

> Really just trying to do a simple add and update test, the chain
> missing is just proof of my not understanding exactly how this is
> supposed to work.  I modified the code to this
>
>                String key = "1";
>
>                SolrInputDocument solrDoc = new SolrInputDocument();
>                solrDoc.setField("key", key);
>
>                 solrDoc.addField("content_mvtxt", "initial value");
>
>                SolrServer server = servers
>                                .get("
> http://localhost:8983/solr/collection1");
>
>                 UpdateRequest ureq = new UpdateRequest();
>                ureq.setParam("update.chain", "distrib-update-chain");
>                ureq.add(solrDoc);
>                ureq.setParam("shards",
>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>                ureq.setParam("self", "foo");
>                ureq.setAction(ACTION.COMMIT, true, true);
>                server.request(ureq);
>                 server.commit();
>
>                solrDoc = new SolrInputDocument();
>                solrDoc.addField("key", key);
>                 solrDoc.addField("content_mvtxt", "updated value");
>
>                server = servers.get("
> http://localhost:7574/solr/collection1");
>
>                 ureq = new UpdateRequest();
>                ureq.setParam("update.chain", "distrib-update-chain");
>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>                 ureq.add(solrDoc);
>                ureq.setParam("shards",
>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>                ureq.setParam("self", "foo");
>                ureq.setAction(ACTION.COMMIT, true, true);
>                server.request(ureq);
>                 // server.add(solrDoc);
>                server.commit();
>                server = servers.get("
> http://localhost:8983/solr/collection1");
>
>
>                server.commit();
>                System.out.println("done");
>
> but I'm still seeing the doc appear on both shards.    After the first
> commit I see the doc on 8983 with "initial value".  after the second
> commit I see the updated value on 7574 and the old on 8983.  After the
> final commit the doc on 8983 gets updated.
>
> Is there something wrong with my test?
>
> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com>
> wrote:
> > Getting late - didn't really pay attention to your code I guess - why
> are you adding the first doc without specifying the distrib update chain?
> This is not really supported. It's going to just go to the server you
> specified - even with everything setup right, the update might then go to
> that same server or the other one depending on how it hashes. You really
> want to just always use the distrib update chain.  I guess I don't yet
> understand what you are trying to test.
> >
> > Sent from my iPad
> >
> > On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com> wrote:
> >
> >> Not sure offhand - but things will be funky if you don't specify the
> correct numShards.
> >>
> >> The instance to shard assignment should be using numShards to assign.
> But then the hash to shard mapping actually goes on the number of shards it
> finds registered in ZK (it doesn't have to, but really these should be
> equal).
> >>
> >> So basically you are saying, I want 3 partitions, but you are only
> starting up 2 nodes, and the code is just not happy about that I'd guess.
> For the system to work properly, you have to fire up at least as many
> servers as numShards.
> >>
> >> What are you trying to do? 2 partitions with no replicas, or one
> partition with one replica?
> >>
> >> In either case, I think you will have better luck if you fire up at
> least as many servers as the numShards setting. Or lower the numShards
> setting.
> >>
> >> This is all a work in progress by the way - what you are trying to test
> should work if things are setup right though.
> >>
> >> - Mark
> >>
> >>
> >> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
> >>
> >>> Thanks for the quick response.  With that change (have not done
> >>> numShards yet) shard1 got updated.  But now when executing the
> >>> following queries I get information back from both, which doesn't seem
> >>> right
> >>>
> >>> http://localhost:7574/solr/select/?q=*:*
> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> value</str></doc>
> >>>
> >>> http://localhost:8983/solr/select?q=*:*
> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> value</str></doc>
> >>>
> >>>
> >>>
> >>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>> Hmm...sorry bout that - so my first guess is that right now we are
> not distributing a commit (easy to add, just have not done it).
> >>>>
> >>>> Right now I explicitly commit on each server for tests.
> >>>>
> >>>> Can you try explicitly committing on server1 after updating the doc
> on server 2?
> >>>>
> >>>> I can start distributing commits tomorrow - been meaning to do it for
> my own convenience anyhow.
> >>>>
> >>>> Also, you want to pass the sys property numShards=1 on startup. I
> think it defaults to 3. That will give you one leader and one replica.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
> >>>>
> >>>>> So I couldn't resist, I attempted to do this tonight, I used the
> >>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
> >>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
> >>>>> and sent the update to the other.  I don't see the modifications
> >>>>> though I only see the original document.  The following is the test
> >>>>>
> >>>>> public void update() throws Exception {
> >>>>>
> >>>>>              String key = "1";
> >>>>>
> >>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
> >>>>>              solrDoc.setField("key", key);
> >>>>>
> >>>>>              solrDoc.addField("content", "initial value");
> >>>>>
> >>>>>              SolrServer server = servers
> >>>>>                              .get("
> http://localhost:8983/solr/collection1");
> >>>>>              server.add(solrDoc);
> >>>>>
> >>>>>              server.commit();
> >>>>>
> >>>>>              solrDoc = new SolrInputDocument();
> >>>>>              solrDoc.addField("key", key);
> >>>>>              solrDoc.addField("content", "updated value");
> >>>>>
> >>>>>              server = servers.get("
> http://localhost:7574/solr/collection1");
> >>>>>
> >>>>>              UpdateRequest ureq = new UpdateRequest();
> >>>>>              ureq.setParam("update.chain", "distrib-update-chain");
> >>>>>              ureq.add(solrDoc);
> >>>>>              ureq.setParam("shards",
> >>>>>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>>>              ureq.setParam("self", "foo");
> >>>>>              ureq.setAction(ACTION.COMMIT, true, true);
> >>>>>              server.request(ureq);
> >>>>>              System.out.println("done");
> >>>>>      }
> >>>>>
> >>>>> key is my unique field in schema.xml
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
> would
> >>>>>> be simply updating the range assignments in ZK.  Where is this
> >>>>>> currently on the list of things to accomplish?  I don't have time to
> >>>>>> work on this now, but if you (or anyone) could provide direction I'd
> >>>>>> be willing to work on this when I had spare time.  I guess a JIRA
> >>>>>> detailing where/how to do this could help.  Not sure if the design
> has
> >>>>>> been thought out that far though.
> >>>>>>
> >>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>>>>> Right now lets say you have one shard - everything there hashes to
> range X.
> >>>>>>>
> >>>>>>> Now you want to split that shard with an Index Splitter.
> >>>>>>>
> >>>>>>> You divide range X in two - giving you two ranges - then you start
> splitting. This is where the current Splitter needs a little modification.
> You decide which doc should go into which new index by rehashing each doc
> id in the index you are splitting - if its hash is greater than X/2, it
> goes into index1 - if its less, index2. I think there are a couple current
> Splitter impls, but one of them does something like, give me an id - now if
> the id's in the index are above that id, goto index1, if below, index2. We
> need to instead do a quick hash rather than simple id compare.
> >>>>>>>
> >>>>>>> Why do you need to do this on every shard?
> >>>>>>>
> >>>>>>> The other part we need that we dont have is to store hash range
> assignments in zookeeper - we don't do that yet because it's not needed
> yet. Instead we currently just simply calculate that on the fly (too often
> at the moment - on every request :) I intend to fix that of course).
> >>>>>>>
> >>>>>>> At the start, zk would say, for range X, goto this shard. After
> the split, it would say, for range less than X/2 goto the old node, for
> range greater than X/2 goto the new node.
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >>>>>>>
> >>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
> the
> >>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
> is
> >>>>>>>> some logic which is able to tell that a particular range should be
> >>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
> off
> >>>>>>>> between repartitioning the entire index (on every shard) and
> having a
> >>>>>>>> custom hashing algorithm which is able to handle the situation
> where 2
> >>>>>>>> or more shards map to a particular range.
> >>>>>>>>
> >>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
> markrmiller@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>>>>>>>
> >>>>>>>>>> I am not familiar with the index splitter that is in contrib,
> but I'll
> >>>>>>>>>> take a look at it soon.  So the process sounds like it would be
> to run
> >>>>>>>>>> this on all of the current shards indexes based on the hash
> algorithm.
> >>>>>>>>>
> >>>>>>>>> Not something I've thought deeply about myself yet, but I think
> the idea would be to split as many as you felt you needed to.
> >>>>>>>>>
> >>>>>>>>> If you wanted to keep the full balance always, this would mean
> splitting every shard at once, yes. But this depends on how many boxes
> (partitions) you are willing/able to add at a time.
> >>>>>>>>>
> >>>>>>>>> You might just split one index to start - now it's hash range
> would be handled by two shards instead of one (if you have 3 replicas per
> shard, this would mean adding 3 more boxes). When you needed to expand
> again, you would split another index that was still handling its full
> starting range. As you grow, once you split every original index, you'd
> start again, splitting one of the now half ranges.
> >>>>>>>>>
> >>>>>>>>>> Is there also an index merger in contrib which could be used to
> merge
> >>>>>>>>>> indexes?  I'm assuming this would be the process?
> >>>>>>>>>
> >>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
> admin command that can do this). But I'm not sure where this fits in?
> >>>>>>>>>
> >>>>>>>>> - Mark
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
> markrmiller@gmail.com> wrote:
> >>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
> other stuff is
> >>>>>>>>>>> working solid at this point. But someone else could jump in!
> >>>>>>>>>>>
> >>>>>>>>>>> There are a couple ways to go about it that I know of:
> >>>>>>>>>>>
> >>>>>>>>>>> A more long term solution may be to start using micro shards -
> each index
> >>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
> mirco shards
> >>>>>>>>>>> around as you decide to change partitions. It's also less
> flexible as you
> >>>>>>>>>>> are limited by the number of micro shards you start with.
> >>>>>>>>>>>
> >>>>>>>>>>> A more simple and likely first step is to use an index
> splitter . We
> >>>>>>>>>>> already have one in lucene contrib - we would just need to
> modify it so
> >>>>>>>>>>> that it splits based on the hash of the document id. This is
> super
> >>>>>>>>>>> flexible, but splitting will obviously take a little while on
> a huge index.
> >>>>>>>>>>> The current index splitter is a multi pass splitter - good
> enough to start
> >>>>>>>>>>> with, but most files under codec control these days, we may be
> able to make
> >>>>>>>>>>> a single pass splitter soon as well.
> >>>>>>>>>>>
> >>>>>>>>>>> Eventually you could imagine using both options - micro shards
> that could
> >>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
> will be
> >>>>>>>>>>> worth the extra complications myself...
> >>>>>>>>>>>
> >>>>>>>>>>> Right now though, the idea is that you should pick a good
> number of
> >>>>>>>>>>> partitions to start given your expected data ;) Adding more
> replicas is
> >>>>>>>>>>> trivial though.
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
> jej2003@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Another question, is there any support for repartitioning of
> the index
> >>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
> >>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
> probably
> >>>>>>>>>>>> any) would require the index to be repartitioned should a new
> shard be
> >>>>>>>>>>>> added.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
> jej2003@gmail.com> wrote:
> >>>>>>>>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
> markrmiller@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
> jej2003@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
> was
> >>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
> >>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
> solrconfig.xml needs
> >>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >>>>>>>>>>>>>> solr/core/src/test-files
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You need to enable the update log, add an empty replication
> handler def,
> >>>>>>>>>>>>>> and an update chain with
> solr.DistributedUpdateProcessFactory in it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> - Mark Miller
> >>>>>>>>> lucidimagination.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> - Mark Miller
> >>>>>>> lucidimagination.com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>> - Mark Miller
> >>>> lucidimagination.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> - Mark Miller
> >> lucidimagination.com
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Really just trying to do a simple add and update test, the chain
missing is just proof of my not understanding exactly how this is
supposed to work.  I modified the code to this

		String key = "1";

		SolrInputDocument solrDoc = new SolrInputDocument();
		solrDoc.setField("key", key);

		solrDoc.addField("content_mvtxt", "initial value");

		SolrServer server = servers
				.get("http://localhost:8983/solr/collection1");

		UpdateRequest ureq = new UpdateRequest();
		ureq.setParam("update.chain", "distrib-update-chain");
		ureq.add(solrDoc);
		ureq.setParam("shards",
				"localhost:8983/solr/collection1,localhost:7574/solr/collection1");
		ureq.setParam("self", "foo");
		ureq.setAction(ACTION.COMMIT, true, true);
		server.request(ureq);
		server.commit();

		solrDoc = new SolrInputDocument();
		solrDoc.addField("key", key);
		solrDoc.addField("content_mvtxt", "updated value");

		server = servers.get("http://localhost:7574/solr/collection1");

		ureq = new UpdateRequest();
		ureq.setParam("update.chain", "distrib-update-chain");
		// ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
		ureq.add(solrDoc);
		ureq.setParam("shards",
				"localhost:8983/solr/collection1,localhost:7574/solr/collection1");
		ureq.setParam("self", "foo");
		ureq.setAction(ACTION.COMMIT, true, true);
		server.request(ureq);
		// server.add(solrDoc);
		server.commit();
		server = servers.get("http://localhost:8983/solr/collection1");
	

		server.commit();
		System.out.println("done");

but I'm still seeing the doc appear on both shards.    After the first
commit I see the doc on 8983 with "initial value".  after the second
commit I see the updated value on 7574 and the old on 8983.  After the
final commit the doc on 8983 gets updated.

Is there something wrong with my test?

On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <ma...@gmail.com> wrote:
> Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain.  I guess I don't yet understand what you are trying to test.
>
> Sent from my iPad
>
> On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> Not sure offhand - but things will be funky if you don't specify the correct numShards.
>>
>> The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal).
>>
>> So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards.
>>
>> What are you trying to do? 2 partitions with no replicas, or one partition with one replica?
>>
>> In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting.
>>
>> This is all a work in progress by the way - what you are trying to test should work if things are setup right though.
>>
>> - Mark
>>
>>
>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>>
>>> Thanks for the quick response.  With that change (have not done
>>> numShards yet) shard1 got updated.  But now when executing the
>>> following queries I get information back from both, which doesn't seem
>>> right
>>>
>>> http://localhost:7574/solr/select/?q=*:*
>>> <doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>
>>>
>>> http://localhost:8983/solr/select?q=*:*
>>> <doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>
>>>
>>>
>>>
>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com> wrote:
>>>> Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it).
>>>>
>>>> Right now I explicitly commit on each server for tests.
>>>>
>>>> Can you try explicitly committing on server1 after updating the doc on server 2?
>>>>
>>>> I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow.
>>>>
>>>> Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica.
>>>>
>>>> - Mark
>>>>
>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>>>
>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>>>>> and sent the update to the other.  I don't see the modifications
>>>>> though I only see the original document.  The following is the test
>>>>>
>>>>> public void update() throws Exception {
>>>>>
>>>>>              String key = "1";
>>>>>
>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>>>>              solrDoc.setField("key", key);
>>>>>
>>>>>              solrDoc.addField("content", "initial value");
>>>>>
>>>>>              SolrServer server = servers
>>>>>                              .get("http://localhost:8983/solr/collection1");
>>>>>              server.add(solrDoc);
>>>>>
>>>>>              server.commit();
>>>>>
>>>>>              solrDoc = new SolrInputDocument();
>>>>>              solrDoc.addField("key", key);
>>>>>              solrDoc.addField("content", "updated value");
>>>>>
>>>>>              server = servers.get("http://localhost:7574/solr/collection1");
>>>>>
>>>>>              UpdateRequest ureq = new UpdateRequest();
>>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>>>>>              ureq.add(solrDoc);
>>>>>              ureq.setParam("shards",
>>>>>                              "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>>              ureq.setParam("self", "foo");
>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>>>>              server.request(ureq);
>>>>>              System.out.println("done");
>>>>>      }
>>>>>
>>>>> key is my unique field in schema.xml
>>>>>
>>>>> What am I doing wrong?
>>>>>
>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>>>>>> be simply updating the range assignments in ZK.  Where is this
>>>>>> currently on the list of things to accomplish?  I don't have time to
>>>>>> work on this now, but if you (or anyone) could provide direction I'd
>>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>>>>>> detailing where/how to do this could help.  Not sure if the design has
>>>>>> been thought out that far though.
>>>>>>
>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>> Right now lets say you have one shard - everything there hashes to range X.
>>>>>>>
>>>>>>> Now you want to split that shard with an Index Splitter.
>>>>>>>
>>>>>>> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>>>>>>>
>>>>>>> Why do you need to do this on every shard?
>>>>>>>
>>>>>>> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>>>>>>>
>>>>>>> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>>>>>>>
>>>>>>> - Mark
>>>>>>>
>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>>>>
>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>>>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>>>>>> some logic which is able to tell that a particular range should be
>>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>>>>>> between repartitioning the entire index (on every shard) and having a
>>>>>>>> custom hashing algorithm which is able to handle the situation where 2
>>>>>>>> or more shards map to a particular range.
>>>>>>>>
>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>>>>
>>>>>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>>>>>>
>>>>>>>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>>>>>>>
>>>>>>>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>>>>>>>
>>>>>>>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>>>>>>>
>>>>>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>>>>
>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>>>>>>>
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>>>>
>>>>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>>>>
>>>>>>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>>>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>>>>
>>>>>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>>>>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>>>>>>>> with, but most files under codec control these days, we may be able to make
>>>>>>>>>>> a single pass splitter soon as well.
>>>>>>>>>>>
>>>>>>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>>>>>>> worth the extra complications myself...
>>>>>>>>>>>
>>>>>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>>>>>>> trivial though.
>>>>>>>>>>>
>>>>>>>>>>> - Mark
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>>>>>>> added.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> - Mark
>>>>>>>>>>>
>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Mark Miller
>>>>>>>>> lucidimagination.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> - Mark Miller
>>>>>>> lucidimagination.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> - Mark Miller
>>>> lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain.  I guess I don't yet understand what you are trying to test. 

Sent from my iPad

On Dec 1, 2011, at 10:57 PM, Mark Miller <ma...@gmail.com> wrote:

> Not sure offhand - but things will be funky if you don't specify the correct numShards.
> 
> The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal).
> 
> So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards.
> 
> What are you trying to do? 2 partitions with no replicas, or one partition with one replica?
> 
> In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting.
> 
> This is all a work in progress by the way - what you are trying to test should work if things are setup right though.
> 
> - Mark
> 
> 
> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
> 
>> Thanks for the quick response.  With that change (have not done
>> numShards yet) shard1 got updated.  But now when executing the
>> following queries I get information back from both, which doesn't seem
>> right
>> 
>> http://localhost:7574/solr/select/?q=*:*
>> <doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>
>> 
>> http://localhost:8983/solr/select?q=*:*
>> <doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>
>> 
>> 
>> 
>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com> wrote:
>>> Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it).
>>> 
>>> Right now I explicitly commit on each server for tests.
>>> 
>>> Can you try explicitly committing on server1 after updating the doc on server 2?
>>> 
>>> I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow.
>>> 
>>> Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica.
>>> 
>>> - Mark
>>> 
>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>> 
>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>>>> and sent the update to the other.  I don't see the modifications
>>>> though I only see the original document.  The following is the test
>>>> 
>>>> public void update() throws Exception {
>>>> 
>>>>              String key = "1";
>>>> 
>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>>>              solrDoc.setField("key", key);
>>>> 
>>>>              solrDoc.addField("content", "initial value");
>>>> 
>>>>              SolrServer server = servers
>>>>                              .get("http://localhost:8983/solr/collection1");
>>>>              server.add(solrDoc);
>>>> 
>>>>              server.commit();
>>>> 
>>>>              solrDoc = new SolrInputDocument();
>>>>              solrDoc.addField("key", key);
>>>>              solrDoc.addField("content", "updated value");
>>>> 
>>>>              server = servers.get("http://localhost:7574/solr/collection1");
>>>> 
>>>>              UpdateRequest ureq = new UpdateRequest();
>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>>>>              ureq.add(solrDoc);
>>>>              ureq.setParam("shards",
>>>>                              "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>              ureq.setParam("self", "foo");
>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>>>              server.request(ureq);
>>>>              System.out.println("done");
>>>>      }
>>>> 
>>>> key is my unique field in schema.xml
>>>> 
>>>> What am I doing wrong?
>>>> 
>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>>>>> be simply updating the range assignments in ZK.  Where is this
>>>>> currently on the list of things to accomplish?  I don't have time to
>>>>> work on this now, but if you (or anyone) could provide direction I'd
>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>>>>> detailing where/how to do this could help.  Not sure if the design has
>>>>> been thought out that far though.
>>>>> 
>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>> Right now lets say you have one shard - everything there hashes to range X.
>>>>>> 
>>>>>> Now you want to split that shard with an Index Splitter.
>>>>>> 
>>>>>> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>>>>>> 
>>>>>> Why do you need to do this on every shard?
>>>>>> 
>>>>>> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>>>>>> 
>>>>>> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>>> 
>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>>>>> some logic which is able to tell that a particular range should be
>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>>>>> between repartitioning the entire index (on every shard) and having a
>>>>>>> custom hashing algorithm which is able to handle the situation where 2
>>>>>>> or more shards map to a particular range.
>>>>>>> 
>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>>> 
>>>>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>>>>> 
>>>>>>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>>>>>> 
>>>>>>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>>>>>> 
>>>>>>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>>>>>> 
>>>>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>>> 
>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>>> 
>>>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>>> 
>>>>>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>>> 
>>>>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>>>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>>>>>>> with, but most files under codec control these days, we may be able to make
>>>>>>>>>> a single pass splitter soon as well.
>>>>>>>>>> 
>>>>>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>>>>>> worth the extra complications myself...
>>>>>>>>>> 
>>>>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>>>>>> trivial though.
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>>>>>> added.
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Mark Miller
>>>>>>>> lucidimagination.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> - Mark Miller
>>> lucidimagination.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> - Mark Miller
> lucidimagination.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Not sure offhand - but things will be funky if you don't specify the correct numShards.

The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal).

So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards.

What are you trying to do? 2 partitions with no replicas, or one partition with one replica?

In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting.

This is all a work in progress by the way - what you are trying to test should work if things are setup right though.

- Mark


On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:

> Thanks for the quick response.  With that change (have not done
> numShards yet) shard1 got updated.  But now when executing the
> following queries I get information back from both, which doesn't seem
> right
> 
> http://localhost:7574/solr/select/?q=*:*
> <doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>
> 
> http://localhost:8983/solr/select?q=*:*
> <doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>
> 
> 
> 
> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com> wrote:
>> Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it).
>> 
>> Right now I explicitly commit on each server for tests.
>> 
>> Can you try explicitly committing on server1 after updating the doc on server 2?
>> 
>> I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow.
>> 
>> Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica.
>> 
>> - Mark
>> 
>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>> 
>>> So I couldn't resist, I attempted to do this tonight, I used the
>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>>> and sent the update to the other.  I don't see the modifications
>>> though I only see the original document.  The following is the test
>>> 
>>> public void update() throws Exception {
>>> 
>>>               String key = "1";
>>> 
>>>               SolrInputDocument solrDoc = new SolrInputDocument();
>>>               solrDoc.setField("key", key);
>>> 
>>>               solrDoc.addField("content", "initial value");
>>> 
>>>               SolrServer server = servers
>>>                               .get("http://localhost:8983/solr/collection1");
>>>               server.add(solrDoc);
>>> 
>>>               server.commit();
>>> 
>>>               solrDoc = new SolrInputDocument();
>>>               solrDoc.addField("key", key);
>>>               solrDoc.addField("content", "updated value");
>>> 
>>>               server = servers.get("http://localhost:7574/solr/collection1");
>>> 
>>>               UpdateRequest ureq = new UpdateRequest();
>>>               ureq.setParam("update.chain", "distrib-update-chain");
>>>               ureq.add(solrDoc);
>>>               ureq.setParam("shards",
>>>                               "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>               ureq.setParam("self", "foo");
>>>               ureq.setAction(ACTION.COMMIT, true, true);
>>>               server.request(ureq);
>>>               System.out.println("done");
>>>       }
>>> 
>>> key is my unique field in schema.xml
>>> 
>>> What am I doing wrong?
>>> 
>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>>>> be simply updating the range assignments in ZK.  Where is this
>>>> currently on the list of things to accomplish?  I don't have time to
>>>> work on this now, but if you (or anyone) could provide direction I'd
>>>> be willing to work on this when I had spare time.  I guess a JIRA
>>>> detailing where/how to do this could help.  Not sure if the design has
>>>> been thought out that far though.
>>>> 
>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>> Right now lets say you have one shard - everything there hashes to range X.
>>>>> 
>>>>> Now you want to split that shard with an Index Splitter.
>>>>> 
>>>>> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>>>>> 
>>>>> Why do you need to do this on every shard?
>>>>> 
>>>>> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>>>>> 
>>>>> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>> 
>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>>>> some logic which is able to tell that a particular range should be
>>>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>>>> between repartitioning the entire index (on every shard) and having a
>>>>>> custom hashing algorithm which is able to handle the situation where 2
>>>>>> or more shards map to a particular range.
>>>>>> 
>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>> 
>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>> 
>>>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>>>> 
>>>>>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>>>>> 
>>>>>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>>>>> 
>>>>>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>>>>> 
>>>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>> 
>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>> 
>>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>> 
>>>>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>> 
>>>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>>>>>> with, but most files under codec control these days, we may be able to make
>>>>>>>>> a single pass splitter soon as well.
>>>>>>>>> 
>>>>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>>>>> worth the extra complications myself...
>>>>>>>>> 
>>>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>>>>> trivial though.
>>>>>>>>> 
>>>>>>>>> - Mark
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>>>>> added.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>> 
>>>>>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> - Mark
>>>>>>>>>>>> 
>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>> 
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>> 
>>>>>>> 
>>>>>>> - Mark Miller
>>>>>>> lucidimagination.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Thanks for the quick response.  With that change (have not done
numShards yet) shard1 got updated.  But now when executing the
following queries I get information back from both, which doesn't seem
right

http://localhost:7574/solr/select/?q=*:*
<doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>

http://localhost:8983/solr/select?q=*:*
<doc><str name="key">1</str><str name="content_mvtxt">updated value</str></doc>



On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <ma...@gmail.com> wrote:
> Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it).
>
> Right now I explicitly commit on each server for tests.
>
> Can you try explicitly committing on server1 after updating the doc on server 2?
>
> I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow.
>
> Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica.
>
> - Mark
>
> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>
>> So I couldn't resist, I attempted to do this tonight, I used the
>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>> and sent the update to the other.  I don't see the modifications
>> though I only see the original document.  The following is the test
>>
>> public void update() throws Exception {
>>
>>               String key = "1";
>>
>>               SolrInputDocument solrDoc = new SolrInputDocument();
>>               solrDoc.setField("key", key);
>>
>>               solrDoc.addField("content", "initial value");
>>
>>               SolrServer server = servers
>>                               .get("http://localhost:8983/solr/collection1");
>>               server.add(solrDoc);
>>
>>               server.commit();
>>
>>               solrDoc = new SolrInputDocument();
>>               solrDoc.addField("key", key);
>>               solrDoc.addField("content", "updated value");
>>
>>               server = servers.get("http://localhost:7574/solr/collection1");
>>
>>               UpdateRequest ureq = new UpdateRequest();
>>               ureq.setParam("update.chain", "distrib-update-chain");
>>               ureq.add(solrDoc);
>>               ureq.setParam("shards",
>>                               "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>               ureq.setParam("self", "foo");
>>               ureq.setAction(ACTION.COMMIT, true, true);
>>               server.request(ureq);
>>               System.out.println("done");
>>       }
>>
>> key is my unique field in schema.xml
>>
>> What am I doing wrong?
>>
>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>>> be simply updating the range assignments in ZK.  Where is this
>>> currently on the list of things to accomplish?  I don't have time to
>>> work on this now, but if you (or anyone) could provide direction I'd
>>> be willing to work on this when I had spare time.  I guess a JIRA
>>> detailing where/how to do this could help.  Not sure if the design has
>>> been thought out that far though.
>>>
>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>>> Right now lets say you have one shard - everything there hashes to range X.
>>>>
>>>> Now you want to split that shard with an Index Splitter.
>>>>
>>>> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>>>>
>>>> Why do you need to do this on every shard?
>>>>
>>>> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>>>>
>>>> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>>>>
>>>> - Mark
>>>>
>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>
>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>>> some logic which is able to tell that a particular range should be
>>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>>> between repartitioning the entire index (on every shard) and having a
>>>>> custom hashing algorithm which is able to handle the situation where 2
>>>>> or more shards map to a particular range.
>>>>>
>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>
>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>
>>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>>>
>>>>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>>>>
>>>>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>>>>
>>>>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>>>>
>>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>
>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>
>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>
>>>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>
>>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>>>>> with, but most files under codec control these days, we may be able to make
>>>>>>>> a single pass splitter soon as well.
>>>>>>>>
>>>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>>>> worth the extra complications myself...
>>>>>>>>
>>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>>>> trivial though.
>>>>>>>>
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>>>> added.
>>>>>>>>>
>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>
>>>>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> - Mark
>>>>>>>>>>>
>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> http://www.lucidimagination.com
>>>>>>>>
>>>>>>
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> - Mark Miller
>>>> lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it).

Right now I explicitly commit on each server for tests.

Can you try explicitly committing on server1 after updating the doc on server 2?

I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow.

Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica.

- Mark

On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

> So I couldn't resist, I attempted to do this tonight, I used the
> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
> and sent the update to the other.  I don't see the modifications
> though I only see the original document.  The following is the test
> 
> public void update() throws Exception {
> 
> 		String key = "1";
> 
> 		SolrInputDocument solrDoc = new SolrInputDocument();
> 		solrDoc.setField("key", key);
> 
> 		solrDoc.addField("content", "initial value");
> 
> 		SolrServer server = servers
> 				.get("http://localhost:8983/solr/collection1");
> 		server.add(solrDoc);
> 
> 		server.commit();
> 
> 		solrDoc = new SolrInputDocument();
> 		solrDoc.addField("key", key);
> 		solrDoc.addField("content", "updated value");
> 
> 		server = servers.get("http://localhost:7574/solr/collection1");
> 
> 		UpdateRequest ureq = new UpdateRequest();
> 		ureq.setParam("update.chain", "distrib-update-chain");
> 		ureq.add(solrDoc);
> 		ureq.setParam("shards",
> 				"localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> 		ureq.setParam("self", "foo");
> 		ureq.setAction(ACTION.COMMIT, true, true);
> 		server.request(ureq);
> 		System.out.println("done");
> 	}
> 
> key is my unique field in schema.xml
> 
> What am I doing wrong?
> 
> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:
>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>> be simply updating the range assignments in ZK.  Where is this
>> currently on the list of things to accomplish?  I don't have time to
>> work on this now, but if you (or anyone) could provide direction I'd
>> be willing to work on this when I had spare time.  I guess a JIRA
>> detailing where/how to do this could help.  Not sure if the design has
>> been thought out that far though.
>> 
>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>> Right now lets say you have one shard - everything there hashes to range X.
>>> 
>>> Now you want to split that shard with an Index Splitter.
>>> 
>>> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>>> 
>>> Why do you need to do this on every shard?
>>> 
>>> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>>> 
>>> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>>> 
>>> - Mark
>>> 
>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>> 
>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>> some logic which is able to tell that a particular range should be
>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>> between repartitioning the entire index (on every shard) and having a
>>>> custom hashing algorithm which is able to handle the situation where 2
>>>> or more shards map to a particular range.
>>>> 
>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>> 
>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>> 
>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>> 
>>>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>>> 
>>>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>>> 
>>>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>>> 
>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>> indexes?  I'm assuming this would be the process?
>>>>> 
>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>>> 
>>>>> - Mark
>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>> 
>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>> 
>>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>>> are limited by the number of micro shards you start with.
>>>>>>> 
>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>>>> with, but most files under codec control these days, we may be able to make
>>>>>>> a single pass splitter soon as well.
>>>>>>> 
>>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>>> worth the extra complications myself...
>>>>>>> 
>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>>> trivial though.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>>> added.
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>> 
>>>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> - Mark
>>>>>>> 
>>>>>>> http://www.lucidimagination.com
>>>>>>> 
>>>>> 
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> - Mark Miller
>>> lucidimagination.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

So I couldn't resist, I attempted to do this tonight, I used the
solrconfig you mentioned (as is, no modifications), I setup a 2 shard
cluster in collection1, I sent 1 doc to 1 of the shards, updated it
and sent the update to the other.  I don't see the modifications
though I only see the original document.  The following is the test

public void update() throws Exception {

		String key = "1";

		SolrInputDocument solrDoc = new SolrInputDocument();
		solrDoc.setField("key", key);

		solrDoc.addField("content", "initial value");

		SolrServer server = servers
				.get("http://localhost:8983/solr/collection1");
		server.add(solrDoc);

		server.commit();

		solrDoc = new SolrInputDocument();
		solrDoc.addField("key", key);
		solrDoc.addField("content", "updated value");

		server = servers.get("http://localhost:7574/solr/collection1");

		UpdateRequest ureq = new UpdateRequest();
		ureq.setParam("update.chain", "distrib-update-chain");
		ureq.add(solrDoc);
		ureq.setParam("shards",
				"localhost:8983/solr/collection1,localhost:7574/solr/collection1");
		ureq.setParam("self", "foo");
		ureq.setAction(ACTION.COMMIT, true, true);
		server.request(ureq);
		System.out.println("done");
	}

key is my unique field in schema.xml

What am I doing wrong?

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:
> Yes, the ZK method seems much more flexible.  Adding a new shard would
> be simply updating the range assignments in ZK.  Where is this
> currently on the list of things to accomplish?  I don't have time to
> work on this now, but if you (or anyone) could provide direction I'd
> be willing to work on this when I had spare time.  I guess a JIRA
> detailing where/how to do this could help.  Not sure if the design has
> been thought out that far though.
>
> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>> Right now lets say you have one shard - everything there hashes to range X.
>>
>> Now you want to split that shard with an Index Splitter.
>>
>> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>>
>> Why do you need to do this on every shard?
>>
>> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>>
>> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>>
>> - Mark
>>
>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>
>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>> branch, right?  The algorithm you're mentioning sounds like there is
>>> some logic which is able to tell that a particular range should be
>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>> between repartitioning the entire index (on every shard) and having a
>>> custom hashing algorithm which is able to handle the situation where 2
>>> or more shards map to a particular range.
>>>
>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>
>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>
>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>
>>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>>
>>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>>
>>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>>
>>>>> Is there also an index merger in contrib which could be used to merge
>>>>> indexes?  I'm assuming this would be the process?
>>>>
>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>>
>>>> - Mark
>>>>
>>>>>
>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>> working solid at this point. But someone else could jump in!
>>>>>>
>>>>>> There are a couple ways to go about it that I know of:
>>>>>>
>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>> are limited by the number of micro shards you start with.
>>>>>>
>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>> that it splits based on the hash of the document id. This is super
>>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>>> with, but most files under codec control these days, we may be able to make
>>>>>> a single pass splitter soon as well.
>>>>>>
>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>> worth the extra complications myself...
>>>>>>
>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>> trivial though.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>
>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>> added.
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>
>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>> solr/core/src/test-files
>>>>>>>>>
>>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>
>>>> - Mark Miller
>>>> lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Configuring the Distributed

Posted by Ted Dunning <te...@gmail.com>.

Well, this goes both ways.

It is not that unusual to take a node down for maintenance of some kind or
even to have a node failure.  In that case, it is very nice to have the
load from the lost node be spread fairly evenly across the remaining
cluster.

Regarding the cost of having several micro-shards, they are also an
opportunity for threading the search.  Most sites don't have enough queries
coming in to occupy all of the cores in modern machines so threading each
query can actually be a substantial benefit in terms of query time.

On Thu, Dec 1, 2011 at 6:37 PM, Mark Miller <ma...@gmail.com> wrote:

> To kick things off though, adding another partition should be a rare event
> if you plan carefully, and I think many will be able to handle the cost of
> splitting (you might even mark the replica you are splitting on so that
> it's not part of queries while its 'busy' splitting).
>

Re: Configuring the Distributed

Posted by Ted Dunning <te...@gmail.com>.

With micro-shards, you can use random numbers for all placements with minor
constraints like avoiding replicas sitting in the same rack.  Since the
number of shards never changes, things stay very simple.

On Thu, Dec 1, 2011 at 6:44 PM, Mark Miller <ma...@gmail.com> wrote:

> Sorry - missed something - you also have the added cost of shipping the
> new half index to all of the replicas of the original shard with the
> splitting method. Unless you somehow split on every replica at the same
> time - then of course you wouldn't be able to avoid the 'busy' replica, and
> it would probably be fairly hard to juggle.
>
>
> On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:
>
> > In this case we are still talking about moving a whole index at a time
> rather than lots of little documents. You split the index into two, and
> then ship one of them off.
> >
> > The extra cost you can avoid with micro sharding will be the cost of
> splitting the index - which could be significant for a very large index. I
> have not done any tests though.
> >
> > The cost of 20 micro-shards is that you will always have tons of
> segments unless you are very heavily merging - and even in the very unusual
> case of each micro shard being optimized, you have essentially 20 segments.
> Thats best case - normal case is likely in the hundreds.
> >
> > This can be a fairly significant % hit at search time.
> >
> > You also have the added complexity of managing 20 indexes per node in
> solr code.
> >
> > I think that both options have there +/-'s and eventually we could
> perhaps support both.
> >
> > To kick things off though, adding another partition should be a rare
> event if you plan carefully, and I think many will be able to handle the
> cost of splitting (you might even mark the replica you are splitting on so
> that it's not part of queries while its 'busy' splitting).
> >
> > - Mark
> >
> > On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
> >
> >> Of course, resharding is almost never necessary if you use micro-shards.
> >> Micro-shards are shards small enough that you can fit 20 or more on a
> >> node.  If you have that many on each node, then adding a new node
> consists
> >> of moving some shards to the new machine rather than moving lots of
> little
> >> documents.
> >>
> >> Much faster.  As in thousands of times faster.
> >>
> >> On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>
> >>> Yes, the ZK method seems much more flexible.  Adding a new shard would
> >>> be simply updating the range assignments in ZK.  Where is this
> >>> currently on the list of things to accomplish?  I don't have time to
> >>> work on this now, but if you (or anyone) could provide direction I'd
> >>> be willing to work on this when I had spare time.  I guess a JIRA
> >>> detailing where/how to do this could help.  Not sure if the design has
> >>> been thought out that far though.
> >>>
> >>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>> Right now lets say you have one shard - everything there hashes to
> range
> >>> X.
> >>>>
> >>>> Now you want to split that shard with an Index Splitter.
> >>>>
> >>>> You divide range X in two - giving you two ranges - then you start
> >>> splitting. This is where the current Splitter needs a little
> modification.
> >>> You decide which doc should go into which new index by rehashing each
> doc
> >>> id in the index you are splitting - if its hash is greater than X/2, it
> >>> goes into index1 - if its less, index2. I think there are a couple
> current
> >>> Splitter impls, but one of them does something like, give me an id -
> now if
> >>> the id's in the index are above that id, goto index1, if below,
> index2. We
> >>> need to instead do a quick hash rather than simple id compare.
> >>>>
> >>>> Why do you need to do this on every shard?
> >>>>
> >>>> The other part we need that we dont have is to store hash range
> >>> assignments in zookeeper - we don't do that yet because it's not needed
> >>> yet. Instead we currently just simply calculate that on the fly (too
> often
> >>> at the moment - on every request :) I intend to fix that of course).
> >>>>
> >>>> At the start, zk would say, for range X, goto this shard. After the
> >>> split, it would say, for range less than X/2 goto the old node, for
> range
> >>> greater than X/2 goto the new node.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >>>>
> >>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
> >>>>> branch, right?  The algorithm you're mentioning sounds like there is
> >>>>> some logic which is able to tell that a particular range should be
> >>>>> distributed between 2 shards instead of 1.  So seems like a trade off
> >>>>> between repartitioning the entire index (on every shard) and having a
> >>>>> custom hashing algorithm which is able to handle the situation where
> 2
> >>>>> or more shards map to a particular range.
> >>>>>
> >>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>>>>
> >>>>>>> I am not familiar with the index splitter that is in contrib, but
> I'll
> >>>>>>> take a look at it soon.  So the process sounds like it would be to
> run
> >>>>>>> this on all of the current shards indexes based on the hash
> algorithm.
> >>>>>>
> >>>>>> Not something I've thought deeply about myself yet, but I think the
> >>> idea would be to split as many as you felt you needed to.
> >>>>>>
> >>>>>> If you wanted to keep the full balance always, this would mean
> >>> splitting every shard at once, yes. But this depends on how many boxes
> >>> (partitions) you are willing/able to add at a time.
> >>>>>>
> >>>>>> You might just split one index to start - now it's hash range would
> be
> >>> handled by two shards instead of one (if you have 3 replicas per shard,
> >>> this would mean adding 3 more boxes). When you needed to expand again,
> you
> >>> would split another index that was still handling its full starting
> range.
> >>> As you grow, once you split every original index, you'd start again,
> >>> splitting one of the now half ranges.
> >>>>>>
> >>>>>>> Is there also an index merger in contrib which could be used to
> merge
> >>>>>>> indexes?  I'm assuming this would be the process?
> >>>>>>
> >>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin
> >>> command that can do this). But I'm not sure where this fits in?
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>>>
> >>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmiller@gmail.com
> >
> >>> wrote:
> >>>>>>>> Not yet - we don't plan on working on this until a lot of other
> >>> stuff is
> >>>>>>>> working solid at this point. But someone else could jump in!
> >>>>>>>>
> >>>>>>>> There are a couple ways to go about it that I know of:
> >>>>>>>>
> >>>>>>>> A more long term solution may be to start using micro shards -
> each
> >>> index
> >>>>>>>> starts as multiple indexes. This makes it pretty fast to move
> mirco
> >>> shards
> >>>>>>>> around as you decide to change partitions. It's also less flexible
> >>> as you
> >>>>>>>> are limited by the number of micro shards you start with.
> >>>>>>>>
> >>>>>>>> A more simple and likely first step is to use an index splitter .
> We
> >>>>>>>> already have one in lucene contrib - we would just need to modify
> it
> >>> so
> >>>>>>>> that it splits based on the hash of the document id. This is super
> >>>>>>>> flexible, but splitting will obviously take a little while on a
> huge
> >>> index.
> >>>>>>>> The current index splitter is a multi pass splitter - good enough
> to
> >>> start
> >>>>>>>> with, but most files under codec control these days, we may be
> able
> >>> to make
> >>>>>>>> a single pass splitter soon as well.
> >>>>>>>>
> >>>>>>>> Eventually you could imagine using both options - micro shards
> that
> >>> could
> >>>>>>>> also be split as needed. Though I still wonder if micro shards
> will
> >>> be
> >>>>>>>> worth the extra complications myself...
> >>>>>>>>
> >>>>>>>> Right now though, the idea is that you should pick a good number
> of
> >>>>>>>> partitions to start given your expected data ;) Adding more
> replicas
> >>> is
> >>>>>>>> trivial though.
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> Another question, is there any support for repartitioning of the
> >>> index
> >>>>>>>>> if a new shard is added?  What is the recommended approach for
> >>>>>>>>> handling this?  It seemed that the hashing algorithm (and
> probably
> >>>>>>>>> any) would require the index to be repartitioned should a new
> shard
> >>> be
> >>>>>>>>> added.
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>> wrote:
> >>>>>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
> markrmiller@gmail.com
> >>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
> jej2003@gmail.com
> >>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
> >>>>>>>>>>>> wondering if there was any documentation on configuring the
> >>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
> solrconfig.xml
> >>> needs
> >>>>>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >>>>>>>>>>> solr/core/src/test-files
> >>>>>>>>>>>
> >>>>>>>>>>> You need to enable the update log, add an empty replication
> >>> handler def,
> >>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory
> in
> >>> it.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>>
> >>>>>>
> >>>>>> - Mark Miller
> >>>>>> lucidimagination.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> - Mark Miller
> >>>> lucidimagination.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >
> > - Mark Miller
> > lucidimagination.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Sorry - missed something - you also have the added cost of shipping the new half index to all of the replicas of the original shard with the splitting method. Unless you somehow split on every replica at the same time - then of course you wouldn't be able to avoid the 'busy' replica, and it would probably be fairly hard to juggle.


On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:

> In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off.
> 
> The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though.
> 
> The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds.
> 
> This can be a fairly significant % hit at search time.
> 
> You also have the added complexity of managing 20 indexes per node in solr code.
> 
> I think that both options have there +/-'s and eventually we could perhaps support both.
> 
> To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting).
> 
> - Mark
> 
> On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
> 
>> Of course, resharding is almost never necessary if you use micro-shards.
>> Micro-shards are shards small enough that you can fit 20 or more on a
>> node.  If you have that many on each node, then adding a new node consists
>> of moving some shards to the new machine rather than moving lots of little
>> documents.
>> 
>> Much faster.  As in thousands of times faster.
>> 
>> On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>>> be simply updating the range assignments in ZK.  Where is this
>>> currently on the list of things to accomplish?  I don't have time to
>>> work on this now, but if you (or anyone) could provide direction I'd
>>> be willing to work on this when I had spare time.  I guess a JIRA
>>> detailing where/how to do this could help.  Not sure if the design has
>>> been thought out that far though.
>>> 
>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>>> Right now lets say you have one shard - everything there hashes to range
>>> X.
>>>> 
>>>> Now you want to split that shard with an Index Splitter.
>>>> 
>>>> You divide range X in two - giving you two ranges - then you start
>>> splitting. This is where the current Splitter needs a little modification.
>>> You decide which doc should go into which new index by rehashing each doc
>>> id in the index you are splitting - if its hash is greater than X/2, it
>>> goes into index1 - if its less, index2. I think there are a couple current
>>> Splitter impls, but one of them does something like, give me an id - now if
>>> the id's in the index are above that id, goto index1, if below, index2. We
>>> need to instead do a quick hash rather than simple id compare.
>>>> 
>>>> Why do you need to do this on every shard?
>>>> 
>>>> The other part we need that we dont have is to store hash range
>>> assignments in zookeeper - we don't do that yet because it's not needed
>>> yet. Instead we currently just simply calculate that on the fly (too often
>>> at the moment - on every request :) I intend to fix that of course).
>>>> 
>>>> At the start, zk would say, for range X, goto this shard. After the
>>> split, it would say, for range less than X/2 goto the old node, for range
>>> greater than X/2 goto the new node.
>>>> 
>>>> - Mark
>>>> 
>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>> 
>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>>> some logic which is able to tell that a particular range should be
>>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>>> between repartitioning the entire index (on every shard) and having a
>>>>> custom hashing algorithm which is able to handle the situation where 2
>>>>> or more shards map to a particular range.
>>>>> 
>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>> 
>>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>>> 
>>>>>> Not something I've thought deeply about myself yet, but I think the
>>> idea would be to split as many as you felt you needed to.
>>>>>> 
>>>>>> If you wanted to keep the full balance always, this would mean
>>> splitting every shard at once, yes. But this depends on how many boxes
>>> (partitions) you are willing/able to add at a time.
>>>>>> 
>>>>>> You might just split one index to start - now it's hash range would be
>>> handled by two shards instead of one (if you have 3 replicas per shard,
>>> this would mean adding 3 more boxes). When you needed to expand again, you
>>> would split another index that was still handling its full starting range.
>>> As you grow, once you split every original index, you'd start again,
>>> splitting one of the now half ranges.
>>>>>> 
>>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>> 
>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin
>>> command that can do this). But I'm not sure where this fits in?
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>>>>>> Not yet - we don't plan on working on this until a lot of other
>>> stuff is
>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>> 
>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>> 
>>>>>>>> A more long term solution may be to start using micro shards - each
>>> index
>>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco
>>> shards
>>>>>>>> around as you decide to change partitions. It's also less flexible
>>> as you
>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>> 
>>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>>> already have one in lucene contrib - we would just need to modify it
>>> so
>>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>>> flexible, but splitting will obviously take a little while on a huge
>>> index.
>>>>>>>> The current index splitter is a multi pass splitter - good enough to
>>> start
>>>>>>>> with, but most files under codec control these days, we may be able
>>> to make
>>>>>>>> a single pass splitter soon as well.
>>>>>>>> 
>>>>>>>> Eventually you could imagine using both options - micro shards that
>>> could
>>>>>>>> also be split as needed. Though I still wonder if micro shards will
>>> be
>>>>>>>> worth the extra complications myself...
>>>>>>>> 
>>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>>> partitions to start given your expected data ;) Adding more replicas
>>> is
>>>>>>>> trivial though.
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Another question, is there any support for repartitioning of the
>>> index
>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>>> any) would require the index to be repartitioned should a new shard
>>> be
>>>>>>>>> added.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmiller@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2003@gmail.com
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml
>>> needs
>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>> 
>>>>>>>>>>> You need to enable the update log, add an empty replication
>>> handler def,
>>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in
>>> it.
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> - Mark
>>>>>>>>>>> 
>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> http://www.lucidimagination.com
>>>>>>>> 
>>>>>> 
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> - Mark Miller
>>>> lucidimagination.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
> 
> - Mark Miller
> lucidimagination.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off.

The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though.

The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds.

This can be a fairly significant % hit at search time.

You also have the added complexity of managing 20 indexes per node in solr code.

I think that both options have there +/-'s and eventually we could perhaps support both.

To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting).

- Mark

On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:

> Of course, resharding is almost never necessary if you use micro-shards.
> Micro-shards are shards small enough that you can fit 20 or more on a
> node.  If you have that many on each node, then adding a new node consists
> of moving some shards to the new machine rather than moving lots of little
> documents.
> 
> Much faster.  As in thousands of times faster.
> 
> On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>> be simply updating the range assignments in ZK.  Where is this
>> currently on the list of things to accomplish?  I don't have time to
>> work on this now, but if you (or anyone) could provide direction I'd
>> be willing to work on this when I had spare time.  I guess a JIRA
>> detailing where/how to do this could help.  Not sure if the design has
>> been thought out that far though.
>> 
>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
>>> Right now lets say you have one shard - everything there hashes to range
>> X.
>>> 
>>> Now you want to split that shard with an Index Splitter.
>>> 
>>> You divide range X in two - giving you two ranges - then you start
>> splitting. This is where the current Splitter needs a little modification.
>> You decide which doc should go into which new index by rehashing each doc
>> id in the index you are splitting - if its hash is greater than X/2, it
>> goes into index1 - if its less, index2. I think there are a couple current
>> Splitter impls, but one of them does something like, give me an id - now if
>> the id's in the index are above that id, goto index1, if below, index2. We
>> need to instead do a quick hash rather than simple id compare.
>>> 
>>> Why do you need to do this on every shard?
>>> 
>>> The other part we need that we dont have is to store hash range
>> assignments in zookeeper - we don't do that yet because it's not needed
>> yet. Instead we currently just simply calculate that on the fly (too often
>> at the moment - on every request :) I intend to fix that of course).
>>> 
>>> At the start, zk would say, for range X, goto this shard. After the
>> split, it would say, for range less than X/2 goto the old node, for range
>> greater than X/2 goto the new node.
>>> 
>>> - Mark
>>> 
>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>> 
>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>> some logic which is able to tell that a particular range should be
>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>> between repartitioning the entire index (on every shard) and having a
>>>> custom hashing algorithm which is able to handle the situation where 2
>>>> or more shards map to a particular range.
>>>> 
>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>>>>> 
>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>> 
>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>> 
>>>>> Not something I've thought deeply about myself yet, but I think the
>> idea would be to split as many as you felt you needed to.
>>>>> 
>>>>> If you wanted to keep the full balance always, this would mean
>> splitting every shard at once, yes. But this depends on how many boxes
>> (partitions) you are willing/able to add at a time.
>>>>> 
>>>>> You might just split one index to start - now it's hash range would be
>> handled by two shards instead of one (if you have 3 replicas per shard,
>> this would mean adding 3 more boxes). When you needed to expand again, you
>> would split another index that was still handling its full starting range.
>> As you grow, once you split every original index, you'd start again,
>> splitting one of the now half ranges.
>>>>> 
>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>> indexes?  I'm assuming this would be the process?
>>>>> 
>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin
>> command that can do this). But I'm not sure where this fits in?
>>>>> 
>>>>> - Mark
>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>>>>>>> Not yet - we don't plan on working on this until a lot of other
>> stuff is
>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>> 
>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>> 
>>>>>>> A more long term solution may be to start using micro shards - each
>> index
>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco
>> shards
>>>>>>> around as you decide to change partitions. It's also less flexible
>> as you
>>>>>>> are limited by the number of micro shards you start with.
>>>>>>> 
>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>> already have one in lucene contrib - we would just need to modify it
>> so
>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>> flexible, but splitting will obviously take a little while on a huge
>> index.
>>>>>>> The current index splitter is a multi pass splitter - good enough to
>> start
>>>>>>> with, but most files under codec control these days, we may be able
>> to make
>>>>>>> a single pass splitter soon as well.
>>>>>>> 
>>>>>>> Eventually you could imagine using both options - micro shards that
>> could
>>>>>>> also be split as needed. Though I still wonder if micro shards will
>> be
>>>>>>> worth the extra complications myself...
>>>>>>> 
>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>> partitions to start given your expected data ;) Adding more replicas
>> is
>>>>>>> trivial though.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Another question, is there any support for repartitioning of the
>> index
>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>> any) would require the index to be repartitioned should a new shard
>> be
>>>>>>>> added.
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmiller@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2003@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml
>> needs
>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>> 
>>>>>>>>>> You need to enable the update log, add an empty replication
>> handler def,
>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in
>> it.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> - Mark
>>>>>>> 
>>>>>>> http://www.lucidimagination.com
>>>>>>> 
>>>>> 
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> - Mark Miller
>>> lucidimagination.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Ted Dunning <te...@gmail.com>.

Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
node.  If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
documents.

Much faster.  As in thousands of times faster.

On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson <je...@gmail.com> wrote:

> Yes, the ZK method seems much more flexible.  Adding a new shard would
> be simply updating the range assignments in ZK.  Where is this
> currently on the list of things to accomplish?  I don't have time to
> work on this now, but if you (or anyone) could provide direction I'd
> be willing to work on this when I had spare time.  I guess a JIRA
> detailing where/how to do this could help.  Not sure if the design has
> been thought out that far though.
>
> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
> > Right now lets say you have one shard - everything there hashes to range
> X.
> >
> > Now you want to split that shard with an Index Splitter.
> >
> > You divide range X in two - giving you two ranges - then you start
> splitting. This is where the current Splitter needs a little modification.
> You decide which doc should go into which new index by rehashing each doc
> id in the index you are splitting - if its hash is greater than X/2, it
> goes into index1 - if its less, index2. I think there are a couple current
> Splitter impls, but one of them does something like, give me an id - now if
> the id's in the index are above that id, goto index1, if below, index2. We
> need to instead do a quick hash rather than simple id compare.
> >
> > Why do you need to do this on every shard?
> >
> > The other part we need that we dont have is to store hash range
> assignments in zookeeper - we don't do that yet because it's not needed
> yet. Instead we currently just simply calculate that on the fly (too often
> at the moment - on every request :) I intend to fix that of course).
> >
> > At the start, zk would say, for range X, goto this shard. After the
> split, it would say, for range less than X/2 goto the old node, for range
> greater than X/2 goto the new node.
> >
> > - Mark
> >
> > On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >
> >> hmmm.....This doesn't sound like the hashing algorithm that's on the
> >> branch, right?  The algorithm you're mentioning sounds like there is
> >> some logic which is able to tell that a particular range should be
> >> distributed between 2 shards instead of 1.  So seems like a trade off
> >> between repartitioning the entire index (on every shard) and having a
> >> custom hashing algorithm which is able to handle the situation where 2
> >> or more shards map to a particular range.
> >>
> >> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>
> >>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>
> >>>> I am not familiar with the index splitter that is in contrib, but I'll
> >>>> take a look at it soon.  So the process sounds like it would be to run
> >>>> this on all of the current shards indexes based on the hash algorithm.
> >>>
> >>> Not something I've thought deeply about myself yet, but I think the
> idea would be to split as many as you felt you needed to.
> >>>
> >>> If you wanted to keep the full balance always, this would mean
> splitting every shard at once, yes. But this depends on how many boxes
> (partitions) you are willing/able to add at a time.
> >>>
> >>> You might just split one index to start - now it's hash range would be
> handled by two shards instead of one (if you have 3 replicas per shard,
> this would mean adding 3 more boxes). When you needed to expand again, you
> would split another index that was still handling its full starting range.
> As you grow, once you split every original index, you'd start again,
> splitting one of the now half ranges.
> >>>
> >>>> Is there also an index merger in contrib which could be used to merge
> >>>> indexes?  I'm assuming this would be the process?
> >>>
> >>> You can merge with IndexWriter.addIndexes (Solr also has an admin
> command that can do this). But I'm not sure where this fits in?
> >>>
> >>> - Mark
> >>>
> >>>>
> >>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>>> Not yet - we don't plan on working on this until a lot of other
> stuff is
> >>>>> working solid at this point. But someone else could jump in!
> >>>>>
> >>>>> There are a couple ways to go about it that I know of:
> >>>>>
> >>>>> A more long term solution may be to start using micro shards - each
> index
> >>>>> starts as multiple indexes. This makes it pretty fast to move mirco
> shards
> >>>>> around as you decide to change partitions. It's also less flexible
> as you
> >>>>> are limited by the number of micro shards you start with.
> >>>>>
> >>>>> A more simple and likely first step is to use an index splitter . We
> >>>>> already have one in lucene contrib - we would just need to modify it
> so
> >>>>> that it splits based on the hash of the document id. This is super
> >>>>> flexible, but splitting will obviously take a little while on a huge
> index.
> >>>>> The current index splitter is a multi pass splitter - good enough to
> start
> >>>>> with, but most files under codec control these days, we may be able
> to make
> >>>>> a single pass splitter soon as well.
> >>>>>
> >>>>> Eventually you could imagine using both options - micro shards that
> could
> >>>>> also be split as needed. Though I still wonder if micro shards will
> be
> >>>>> worth the extra complications myself...
> >>>>>
> >>>>> Right now though, the idea is that you should pick a good number of
> >>>>> partitions to start given your expected data ;) Adding more replicas
> is
> >>>>> trivial though.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Another question, is there any support for repartitioning of the
> index
> >>>>>> if a new shard is added?  What is the recommended approach for
> >>>>>> handling this?  It seemed that the hashing algorithm (and probably
> >>>>>> any) would require the index to be repartitioned should a new shard
> be
> >>>>>> added.
> >>>>>>
> >>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>
> >>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmiller@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I am currently looking at the latest solrcloud branch and was
> >>>>>>>>> wondering if there was any documentation on configuring the
> >>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml
> needs
> >>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >>>>>>>> solr/core/src/test-files
> >>>>>>>>
> >>>>>>>> You need to enable the update log, add an empty replication
> handler def,
> >>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in
> it.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> - Mark
> >>>>>
> >>>>> http://www.lucidimagination.com
> >>>>>
> >>>
> >>> - Mark Miller
> >>> lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> > - Mark Miller
> > lucidimagination.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

It's not full of details yet, but there is a JIRA issue here:
https://issues.apache.org/jira/browse/SOLR-2595

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <je...@gmail.com> wrote:

> Yes, the ZK method seems much more flexible.  Adding a new shard would
> be simply updating the range assignments in ZK.  Where is this
> currently on the list of things to accomplish?  I don't have time to
> work on this now, but if you (or anyone) could provide direction I'd
> be willing to work on this when I had spare time.  I guess a JIRA
> detailing where/how to do this could help.  Not sure if the design has
> been thought out that far though.
>
> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
> > Right now lets say you have one shard - everything there hashes to range
> X.
> >
> > Now you want to split that shard with an Index Splitter.
> >
> > You divide range X in two - giving you two ranges - then you start
> splitting. This is where the current Splitter needs a little modification.
> You decide which doc should go into which new index by rehashing each doc
> id in the index you are splitting - if its hash is greater than X/2, it
> goes into index1 - if its less, index2. I think there are a couple current
> Splitter impls, but one of them does something like, give me an id - now if
> the id's in the index are above that id, goto index1, if below, index2. We
> need to instead do a quick hash rather than simple id compare.
> >
> > Why do you need to do this on every shard?
> >
> > The other part we need that we dont have is to store hash range
> assignments in zookeeper - we don't do that yet because it's not needed
> yet. Instead we currently just simply calculate that on the fly (too often
> at the moment - on every request :) I intend to fix that of course).
> >
> > At the start, zk would say, for range X, goto this shard. After the
> split, it would say, for range less than X/2 goto the old node, for range
> greater than X/2 goto the new node.
> >
> > - Mark
> >
> > On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >
> >> hmmm.....This doesn't sound like the hashing algorithm that's on the
> >> branch, right?  The algorithm you're mentioning sounds like there is
> >> some logic which is able to tell that a particular range should be
> >> distributed between 2 shards instead of 1.  So seems like a trade off
> >> between repartitioning the entire index (on every shard) and having a
> >> custom hashing algorithm which is able to handle the situation where 2
> >> or more shards map to a particular range.
> >>
> >> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>
> >>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>
> >>>> I am not familiar with the index splitter that is in contrib, but I'll
> >>>> take a look at it soon.  So the process sounds like it would be to run
> >>>> this on all of the current shards indexes based on the hash algorithm.
> >>>
> >>> Not something I've thought deeply about myself yet, but I think the
> idea would be to split as many as you felt you needed to.
> >>>
> >>> If you wanted to keep the full balance always, this would mean
> splitting every shard at once, yes. But this depends on how many boxes
> (partitions) you are willing/able to add at a time.
> >>>
> >>> You might just split one index to start - now it's hash range would be
> handled by two shards instead of one (if you have 3 replicas per shard,
> this would mean adding 3 more boxes). When you needed to expand again, you
> would split another index that was still handling its full starting range.
> As you grow, once you split every original index, you'd start again,
> splitting one of the now half ranges.
> >>>
> >>>> Is there also an index merger in contrib which could be used to merge
> >>>> indexes?  I'm assuming this would be the process?
> >>>
> >>> You can merge with IndexWriter.addIndexes (Solr also has an admin
> command that can do this). But I'm not sure where this fits in?
> >>>
> >>> - Mark
> >>>
> >>>>
> >>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>>> Not yet - we don't plan on working on this until a lot of other
> stuff is
> >>>>> working solid at this point. But someone else could jump in!
> >>>>>
> >>>>> There are a couple ways to go about it that I know of:
> >>>>>
> >>>>> A more long term solution may be to start using micro shards - each
> index
> >>>>> starts as multiple indexes. This makes it pretty fast to move mirco
> shards
> >>>>> around as you decide to change partitions. It's also less flexible
> as you
> >>>>> are limited by the number of micro shards you start with.
> >>>>>
> >>>>> A more simple and likely first step is to use an index splitter . We
> >>>>> already have one in lucene contrib - we would just need to modify it
> so
> >>>>> that it splits based on the hash of the document id. This is super
> >>>>> flexible, but splitting will obviously take a little while on a huge
> index.
> >>>>> The current index splitter is a multi pass splitter - good enough to
> start
> >>>>> with, but most files under codec control these days, we may be able
> to make
> >>>>> a single pass splitter soon as well.
> >>>>>
> >>>>> Eventually you could imagine using both options - micro shards that
> could
> >>>>> also be split as needed. Though I still wonder if micro shards will
> be
> >>>>> worth the extra complications myself...
> >>>>>
> >>>>> Right now though, the idea is that you should pick a good number of
> >>>>> partitions to start given your expected data ;) Adding more replicas
> is
> >>>>> trivial though.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Another question, is there any support for repartitioning of the
> index
> >>>>>> if a new shard is added?  What is the recommended approach for
> >>>>>> handling this?  It seemed that the hashing algorithm (and probably
> >>>>>> any) would require the index to be repartitioned should a new shard
> be
> >>>>>> added.
> >>>>>>
> >>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>
> >>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmiller@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I am currently looking at the latest solrcloud branch and was
> >>>>>>>>> wondering if there was any documentation on configuring the
> >>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml
> needs
> >>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >>>>>>>> solr/core/src/test-files
> >>>>>>>>
> >>>>>>>> You need to enable the update log, add an empty replication
> handler def,
> >>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in
> it.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> - Mark
> >>>>>
> >>>>> http://www.lucidimagination.com
> >>>>>
> >>>
> >>> - Mark Miller
> >>> lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> > - Mark Miller
> > lucidimagination.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Yes, the ZK method seems much more flexible.  Adding a new shard would
be simply updating the range assignments in ZK.  Where is this
currently on the list of things to accomplish?  I don't have time to
work on this now, but if you (or anyone) could provide direction I'd
be willing to work on this when I had spare time.  I guess a JIRA
detailing where/how to do this could help.  Not sure if the design has
been thought out that far though.

On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <ma...@gmail.com> wrote:
> Right now lets say you have one shard - everything there hashes to range X.
>
> Now you want to split that shard with an Index Splitter.
>
> You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
>
> Why do you need to do this on every shard?
>
> The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).
>
> At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.
>
> - Mark
>
> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>
>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>> branch, right?  The algorithm you're mentioning sounds like there is
>> some logic which is able to tell that a particular range should be
>> distributed between 2 shards instead of 1.  So seems like a trade off
>> between repartitioning the entire index (on every shard) and having a
>> custom hashing algorithm which is able to handle the situation where 2
>> or more shards map to a particular range.
>>
>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>>>
>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>
>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>> take a look at it soon.  So the process sounds like it would be to run
>>>> this on all of the current shards indexes based on the hash algorithm.
>>>
>>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>>>
>>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>>>
>>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>>>
>>>> Is there also an index merger in contrib which could be used to merge
>>>> indexes?  I'm assuming this would be the process?
>>>
>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>>>
>>> - Mark
>>>
>>>>
>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>> working solid at this point. But someone else could jump in!
>>>>>
>>>>> There are a couple ways to go about it that I know of:
>>>>>
>>>>> A more long term solution may be to start using micro shards - each index
>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>> are limited by the number of micro shards you start with.
>>>>>
>>>>> A more simple and likely first step is to use an index splitter . We
>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>> that it splits based on the hash of the document id. This is super
>>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>> with, but most files under codec control these days, we may be able to make
>>>>> a single pass splitter soon as well.
>>>>>
>>>>> Eventually you could imagine using both options - micro shards that could
>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>> worth the extra complications myself...
>>>>>
>>>>> Right now though, the idea is that you should pick a good number of
>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>> trivial though.
>>>>>
>>>>> - Mark
>>>>>
>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>
>>>>>> Another question, is there any support for repartitioning of the index
>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>> added.
>>>>>>
>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>>> wrote:
>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>> solr/core/src/test-files
>>>>>>>>
>>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> http://www.lucidimagination.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>
>>> - Mark Miller
>>> lucidimagination.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Right now lets say you have one shard - everything there hashes to range X.

Now you want to split that shard with an Index Splitter.

You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare.
 
Why do you need to do this on every shard?

The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course).

At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node.

- Mark

On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

> hmmm.....This doesn't sound like the hashing algorithm that's on the
> branch, right?  The algorithm you're mentioning sounds like there is
> some logic which is able to tell that a particular range should be
> distributed between 2 shards instead of 1.  So seems like a trade off
> between repartitioning the entire index (on every shard) and having a
> custom hashing algorithm which is able to handle the situation where 2
> or more shards map to a particular range.
> 
> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>> 
>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>> 
>>> I am not familiar with the index splitter that is in contrib, but I'll
>>> take a look at it soon.  So the process sounds like it would be to run
>>> this on all of the current shards indexes based on the hash algorithm.
>> 
>> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>> 
>> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>> 
>> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>> 
>>> Is there also an index merger in contrib which could be used to merge
>>> indexes?  I'm assuming this would be the process?
>> 
>> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>> 
>> - Mark
>> 
>>> 
>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>> working solid at this point. But someone else could jump in!
>>>> 
>>>> There are a couple ways to go about it that I know of:
>>>> 
>>>> A more long term solution may be to start using micro shards - each index
>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>> around as you decide to change partitions. It's also less flexible as you
>>>> are limited by the number of micro shards you start with.
>>>> 
>>>> A more simple and likely first step is to use an index splitter . We
>>>> already have one in lucene contrib - we would just need to modify it so
>>>> that it splits based on the hash of the document id. This is super
>>>> flexible, but splitting will obviously take a little while on a huge index.
>>>> The current index splitter is a multi pass splitter - good enough to start
>>>> with, but most files under codec control these days, we may be able to make
>>>> a single pass splitter soon as well.
>>>> 
>>>> Eventually you could imagine using both options - micro shards that could
>>>> also be split as needed. Though I still wonder if micro shards will be
>>>> worth the extra complications myself...
>>>> 
>>>> Right now though, the idea is that you should pick a good number of
>>>> partitions to start given your expected data ;) Adding more replicas is
>>>> trivial though.
>>>> 
>>>> - Mark
>>>> 
>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>> 
>>>>> Another question, is there any support for repartitioning of the index
>>>>> if a new shard is added?  What is the recommended approach for
>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>> any) would require the index to be repartitioned should a new shard be
>>>>> added.
>>>>> 
>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>> Thanks I will try this first thing in the morning.
>>>>>> 
>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>> solr/core/src/test-files
>>>>>>> 
>>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>> 
>>>>>>> --
>>>>>>> - Mark
>>>>>>> 
>>>>>>> http://www.lucidimagination.com
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> - Mark
>>>> 
>>>> http://www.lucidimagination.com
>>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

hmmm.....This doesn't sound like the hashing algorithm that's on the
branch, right?  The algorithm you're mentioning sounds like there is
some logic which is able to tell that a particular range should be
distributed between 2 shards instead of 1.  So seems like a trade off
between repartitioning the entire index (on every shard) and having a
custom hashing algorithm which is able to handle the situation where 2
or more shards map to a particular range.

On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:
>
> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>
>> I am not familiar with the index splitter that is in contrib, but I'll
>> take a look at it soon.  So the process sounds like it would be to run
>> this on all of the current shards indexes based on the hash algorithm.
>
> Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.
>
> If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.
>
> You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.
>
>> Is there also an index merger in contrib which could be used to merge
>> indexes?  I'm assuming this would be the process?
>
> You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?
>
> - Mark
>
>>
>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>> working solid at this point. But someone else could jump in!
>>>
>>> There are a couple ways to go about it that I know of:
>>>
>>> A more long term solution may be to start using micro shards - each index
>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>> around as you decide to change partitions. It's also less flexible as you
>>> are limited by the number of micro shards you start with.
>>>
>>> A more simple and likely first step is to use an index splitter . We
>>> already have one in lucene contrib - we would just need to modify it so
>>> that it splits based on the hash of the document id. This is super
>>> flexible, but splitting will obviously take a little while on a huge index.
>>> The current index splitter is a multi pass splitter - good enough to start
>>> with, but most files under codec control these days, we may be able to make
>>> a single pass splitter soon as well.
>>>
>>> Eventually you could imagine using both options - micro shards that could
>>> also be split as needed. Though I still wonder if micro shards will be
>>> worth the extra complications myself...
>>>
>>> Right now though, the idea is that you should pick a good number of
>>> partitions to start given your expected data ;) Adding more replicas is
>>> trivial though.
>>>
>>> - Mark
>>>
>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>>> Another question, is there any support for repartitioning of the index
>>>> if a new shard is added?  What is the recommended approach for
>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>> any) would require the index to be repartitioned should a new shard be
>>>> added.
>>>>
>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> Thanks I will try this first thing in the morning.
>>>>>
>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>>>
>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>> wondering if there was any documentation on configuring the
>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>> solr/core/src/test-files
>>>>>>
>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

> I am not familiar with the index splitter that is in contrib, but I'll
> take a look at it soon.  So the process sounds like it would be to run
> this on all of the current shards indexes based on the hash algorithm.

Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to.

If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time.

You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges.

> Is there also an index merger in contrib which could be used to merge
> indexes?  I'm assuming this would be the process?

You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in?

- Mark

> 
> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
>> Not yet - we don't plan on working on this until a lot of other stuff is
>> working solid at this point. But someone else could jump in!
>> 
>> There are a couple ways to go about it that I know of:
>> 
>> A more long term solution may be to start using micro shards - each index
>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>> around as you decide to change partitions. It's also less flexible as you
>> are limited by the number of micro shards you start with.
>> 
>> A more simple and likely first step is to use an index splitter . We
>> already have one in lucene contrib - we would just need to modify it so
>> that it splits based on the hash of the document id. This is super
>> flexible, but splitting will obviously take a little while on a huge index.
>> The current index splitter is a multi pass splitter - good enough to start
>> with, but most files under codec control these days, we may be able to make
>> a single pass splitter soon as well.
>> 
>> Eventually you could imagine using both options - micro shards that could
>> also be split as needed. Though I still wonder if micro shards will be
>> worth the extra complications myself...
>> 
>> Right now though, the idea is that you should pick a good number of
>> partitions to start given your expected data ;) Adding more replicas is
>> trivial though.
>> 
>> - Mark
>> 
>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> Another question, is there any support for repartitioning of the index
>>> if a new shard is added?  What is the recommended approach for
>>> handling this?  It seemed that the hashing algorithm (and probably
>>> any) would require the index to be repartitioned should a new shard be
>>> added.
>>> 
>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>> Thanks I will try this first thing in the morning.
>>>> 
>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>> wondering if there was any documentation on configuring the
>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>> to be added/modified to make distributed indexing work?
>>>>>> 
>>>>> 
>>>>> 
>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>> solr/core/src/test-files
>>>>> 
>>>>> You need to enable the update log, add an empty replication handler def,
>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>> 
>>>>> --
>>>>> - Mark
>>>>> 
>>>>> http://www.lucidimagination.com
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> - Mark
>> 
>> http://www.lucidimagination.com
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

I am not familiar with the index splitter that is in contrib, but I'll
take a look at it soon.  So the process sounds like it would be to run
this on all of the current shards indexes based on the hash algorithm.
 Is there also an index merger in contrib which could be used to merge
indexes?  I'm assuming this would be the process?

On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <ma...@gmail.com> wrote:
> Not yet - we don't plan on working on this until a lot of other stuff is
> working solid at this point. But someone else could jump in!
>
> There are a couple ways to go about it that I know of:
>
> A more long term solution may be to start using micro shards - each index
> starts as multiple indexes. This makes it pretty fast to move mirco shards
> around as you decide to change partitions. It's also less flexible as you
> are limited by the number of micro shards you start with.
>
> A more simple and likely first step is to use an index splitter . We
> already have one in lucene contrib - we would just need to modify it so
> that it splits based on the hash of the document id. This is super
> flexible, but splitting will obviously take a little while on a huge index.
> The current index splitter is a multi pass splitter - good enough to start
> with, but most files under codec control these days, we may be able to make
> a single pass splitter soon as well.
>
> Eventually you could imagine using both options - micro shards that could
> also be split as needed. Though I still wonder if micro shards will be
> worth the extra complications myself...
>
> Right now though, the idea is that you should pick a good number of
> partitions to start given your expected data ;) Adding more replicas is
> trivial though.
>
> - Mark
>
> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> Another question, is there any support for repartitioning of the index
>> if a new shard is added?  What is the recommended approach for
>> handling this?  It seemed that the hashing algorithm (and probably
>> any) would require the index to be repartitioned should a new shard be
>> added.
>>
>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
>> > Thanks I will try this first thing in the morning.
>> >
>> > On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>
>> >>> I am currently looking at the latest solrcloud branch and was
>> >>> wondering if there was any documentation on configuring the
>> >>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>> >>> to be added/modified to make distributed indexing work?
>> >>>
>> >>
>> >>
>> >> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>> >> solr/core/src/test-files
>> >>
>> >> You need to enable the update log, add an empty replication handler def,
>> >> and an update chain with solr.DistributedUpdateProcessFactory in it.
>> >>
>> >> --
>> >> - Mark
>> >>
>> >> http://www.lucidimagination.com
>> >>
>> >
>>
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

Not yet - we don't plan on working on this until a lot of other stuff is
working solid at this point. But someone else could jump in!

There are a couple ways to go about it that I know of:

A more long term solution may be to start using micro shards - each index
starts as multiple indexes. This makes it pretty fast to move mirco shards
around as you decide to change partitions. It's also less flexible as you
are limited by the number of micro shards you start with.

A more simple and likely first step is to use an index splitter . We
already have one in lucene contrib - we would just need to modify it so
that it splits based on the hash of the document id. This is super
flexible, but splitting will obviously take a little while on a huge index.
The current index splitter is a multi pass splitter - good enough to start
with, but most files under codec control these days, we may be able to make
a single pass splitter soon as well.

Eventually you could imagine using both options - micro shards that could
also be split as needed. Though I still wonder if micro shards will be
worth the extra complications myself...

Right now though, the idea is that you should pick a good number of
partitions to start given your expected data ;) Adding more replicas is
trivial though.

- Mark

On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <je...@gmail.com> wrote:

> Another question, is there any support for repartitioning of the index
> if a new shard is added?  What is the recommended approach for
> handling this?  It seemed that the hashing algorithm (and probably
> any) would require the index to be repartitioned should a new shard be
> added.
>
> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
> > Thanks I will try this first thing in the morning.
> >
> > On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>
> >>> I am currently looking at the latest solrcloud branch and was
> >>> wondering if there was any documentation on configuring the
> >>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
> >>> to be added/modified to make distributed indexing work?
> >>>
> >>
> >>
> >> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >> solr/core/src/test-files
> >>
> >> You need to enable the update log, add an empty replication handler def,
> >> and an update chain with solr.DistributedUpdateProcessFactory in it.
> >>
> >> --
> >> - Mark
> >>
> >> http://www.lucidimagination.com
> >>
> >
>

-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Another question, is there any support for repartitioning of the index
if a new shard is added?  What is the recommended approach for
handling this?  It seemed that the hashing algorithm (and probably
any) would require the index to be repartitioned should a new shard be
added.

On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <je...@gmail.com> wrote:
> Thanks I will try this first thing in the morning.
>
> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com> wrote:
>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com> wrote:
>>
>>> I am currently looking at the latest solrcloud branch and was
>>> wondering if there was any documentation on configuring the
>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>> to be added/modified to make distributed indexing work?
>>>
>>
>>
>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>> solr/core/src/test-files
>>
>> You need to enable the update log, add an empty replication handler def,
>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>

Re: Configuring the Distributed

Posted by Jamie Johnson <je...@gmail.com>.

Thanks I will try this first thing in the morning.

On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <ma...@gmail.com> wrote:
> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com> wrote:
>
>> I am currently looking at the latest solrcloud branch and was
>> wondering if there was any documentation on configuring the
>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>> to be added/modified to make distributed indexing work?
>>
>
>
> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> solr/core/src/test-files
>
> You need to enable the update log, add an empty replication handler def,
> and an update chain with solr.DistributedUpdateProcessFactory in it.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>

Re: Configuring the Distributed

Posted by Mark Miller <ma...@gmail.com>.

On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <je...@gmail.com> wrote:

> I am currently looking at the latest solrcloud branch and was
> wondering if there was any documentation on configuring the
> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
> to be added/modified to make distributed indexing work?
>

Hi Jaime - take a look at solrconfig-distrib-update.xml in
solr/core/src/test-files

You need to enable the update log, add an empty replication handler def,
and an update chain with solr.DistributedUpdateProcessFactory in it.

-- 
- Mark

http://www.lucidimagination.com