You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thierry Thelliez <th...@gmail.com> on 2013/08/13 00:50:18 UTC

What gets written to the other shards?

Hello,  I am trying to set a four shard system for the first time.  I do
not understand why all the shards data are growing at about the same rate
when I push the documents to only one shard.

The four shards represent four calendar years.  And for now, on a
development machine, these four shards run on four different ports.

The first shard is started with Zookeeper.

The log of the other shards is filed with something like:

7882051 [qtp1154079020-1245] INFO
org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
webapp=/solr path=/update params={distrib.from=
http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
{add=[14939-96467-304 (1443204912169091072), 14939-96467-308
(1443204912179576832), 14939-96467-310 (1443204912185868288),
14939-96467-311 (1443204912192159744), 14939-96467-313
(1443204912204742656), 14939-96467-314 (1443204912220471296),
14939-96467-318 (1443204912239345664), 14939-96467-319
(1443204912250880000), 14939-96467-322 (1443204912257171456),
14939-96467-324 (1443204912263462912)]} 0 282

What is getting written to the other shards? Is a separate index computed
on all four shards?  I thought that when pushing a document to one shard,
only that shard would update its index.


Thanks,
Thierry

Re: What gets written to the other shards?

Posted by Thierry Thelliez <th...@gmail.com>.
Greg,  what do you mean by 'manually setting the shard on each document'?
 I explicitly push the documents to their respective shard/port numbers.
 Something like
curl http://localhost:shardport/solr/update --data-binary file.csv -H
'Content-type:text/csv; charset=ISO-8859-1'

I guess that this is making the routing 'implicit'?

For example I am pushing 2M rows from CSV files to one shard (About 2GB).
 The upload is not finished but I am seeing all the shards data directory
at about 14GB...    I was expecting only the current shard to grow.

Thierry


On Mon, Aug 12, 2013 at 4:59 PM, Greg Preston <gp...@marinsoftware.com>wrote:

> Are you manually setting the shard on each document?  If not, documents
> will be hashed across all the shards.
>
> -Greg
>
>
> On Mon, Aug 12, 2013 at 3:50 PM, Thierry Thelliez <
> thierry.thelliez.tech@gmail.com> wrote:
>
> > Hello,  I am trying to set a four shard system for the first time.  I do
> > not understand why all the shards data are growing at about the same rate
> > when I push the documents to only one shard.
> >
> > The four shards represent four calendar years.  And for now, on a
> > development machine, these four shards run on four different ports.
> >
> > The first shard is started with Zookeeper.
> >
> > The log of the other shards is filed with something like:
> >
> > 7882051 [qtp1154079020-1245] INFO
> > org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
> > webapp=/solr path=/update params={distrib.from=
> >
> >
> http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2
> > }
> > {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
> > (1443204912179576832), 14939-96467-310 (1443204912185868288),
> > 14939-96467-311 (1443204912192159744), 14939-96467-313
> > (1443204912204742656), 14939-96467-314 (1443204912220471296),
> > 14939-96467-318 (1443204912239345664), 14939-96467-319
> > (1443204912250880000), 14939-96467-322 (1443204912257171456),
> > 14939-96467-324 (1443204912263462912)]} 0 282
> >
> > What is getting written to the other shards? Is a separate index computed
> > on all four shards?  I thought that when pushing a document to one shard,
> > only that shard would update its index.
> >
> >
> > Thanks,
> > Thierry
> >
>

Re: What gets written to the other shards?

Posted by Greg Preston <gp...@marinsoftware.com>.
Are you manually setting the shard on each document?  If not, documents
will be hashed across all the shards.

-Greg


On Mon, Aug 12, 2013 at 3:50 PM, Thierry Thelliez <
thierry.thelliez.tech@gmail.com> wrote:

> Hello,  I am trying to set a four shard system for the first time.  I do
> not understand why all the shards data are growing at about the same rate
> when I push the documents to only one shard.
>
> The four shards represent four calendar years.  And for now, on a
> development machine, these four shards run on four different ports.
>
> The first shard is started with Zookeeper.
>
> The log of the other shards is filed with something like:
>
> 7882051 [qtp1154079020-1245] INFO
> org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
> webapp=/solr path=/update params={distrib.from=
>
> http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2
> }
> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
> (1443204912179576832), 14939-96467-310 (1443204912185868288),
> 14939-96467-311 (1443204912192159744), 14939-96467-313
> (1443204912204742656), 14939-96467-314 (1443204912220471296),
> 14939-96467-318 (1443204912239345664), 14939-96467-319
> (1443204912250880000), 14939-96467-322 (1443204912257171456),
> 14939-96467-324 (1443204912263462912)]} 0 282
>
> What is getting written to the other shards? Is a separate index computed
> on all four shards?  I thought that when pushing a document to one shard,
> only that shard would update its index.
>
>
> Thanks,
> Thierry
>

Re: What gets written to the other shards?

Posted by Chris Hostetter <ho...@fucit.org>.
: "Collections that do not specify numShards at collection creation time use
: custom sharding and default to the "implicit" router. Document updates
: received by a shard will be indexed to that shard, unless a "*shard*"
: parameter or document field names a different shard."

the word "implicit" there may be somewhat confusin -- but yes otherwise 
that statement is correct (as i understand it).  If you don't tell Solr in 
advance how many shards you want the collection to have, then it can't 
compute the shard assignments for you, and you are responsible for sending 
each doc to the appropriate shard.

The other approach i was refering to is the way to explicitly override the 
document routing, even if you have a fixed number of shards -- there's a 
Java API you can use to do this (i don't recall what it is off the top of 
my head) but there is also a a mechanism using the default router code 
where having a "!" in your uniqueKey field will cause only the portion of 
the ID preceding the "!" to be used in computing the shard -- see the link 
i mentioned earlier.

: I do specify the numShards parameter and that might be the confusion. The
: admin UI is telling me that the routing is 'implicit' when it might not be?
: Shouldn't it be compositeID routing if I use numShards?

Again -- the terminology might be confusion .. i'm certainly confused, i'm 
not sure off the top of my head which one is refered to as "implicit".


: > https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
: > https://cwiki.apache.org/confluence/display/solr/Distributed+Requests


-Hoss

Re: What gets written to the other shards?

Posted by Thierry Thelliez <th...@gmail.com>.
Hoss,

>From the Solr 4.1 release highlights, under the SolrCloud enhancements
section:

"Collections that do not specify numShards at collection creation time use
custom sharding and default to the "implicit" router. Document updates
received by a shard will be indexed to that shard, unless a "*shard*"
parameter or document field names a different shard."


So it does appear that it is possible to use implicit document routing
under SolrCloud.

I do specify the numShards parameter and that might be the confusion. The
admin UI is telling me that the routing is 'implicit' when it might not be?
Shouldn't it be compositeID routing if I use numShards?

It might be that the change in 4.1 was not clear?
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3C0AA0B422-F1DE-4915-B602-53CB1849204A@gmail.com%3E

I think that I need to experiment without the numShards parameter.

Ultimately what I want is that all the shards (=years) be queried if no
shard is explicitly specified.  But eventually the users will be able to
pick a given date range.  Then we want to query only the matching shards.
 Better ways to do it?


-Thierry












On Mon, Aug 12, 2013 at 7:39 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : If that is the case, I think that my settings are correct.   I still do
> not
> : explain why I have such growth on all the shards at the same time.
>
> you are missunderstanding how SolrCLoud works.
>
> Unless you go out of your way to override hte document routing, Solr will
> compute a logical shard to assign each doc to using a hash on the id -- it
> doesn't matter which physical node you send the doc to, solr will
> internally forward it to the correct physical nodes of the logical shard
> it belongs to.
>
> If it is important to you that a single shard represents a calander year,
> then you need to override the shard assignemnt algorithm -- either that,
> or use a distinct *collection* per claander year, and then do
> multi-collection queries when you want to execute queries across multiple
> years ... it all depends on what your "common case" queries are going to
> look like...
>
>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
> https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
>
> : One thing I noticed is that three of them are leaders in the SolrCloud
> : admin UI graph.  Is that normal?
>
> if you have 4 shards, then there should be 4 leaders -- leaders are about
> coordinating the duplicate physical copies of each doc in each replica of
> hte logical shard -- if you only have 1 phyiscal replica of each logical
> shard, then every replica is it's own leader.
>
>
> -Hoss
>

Re: What gets written to the other shards?

Posted by Chris Hostetter <ho...@fucit.org>.
: If that is the case, I think that my settings are correct.   I still do not
: explain why I have such growth on all the shards at the same time.

you are missunderstanding how SolrCLoud works.

Unless you go out of your way to override hte document routing, Solr will 
compute a logical shard to assign each doc to using a hash on the id -- it 
doesn't matter which physical node you send the doc to, solr will 
internally forward it to the correct physical nodes of the logical shard 
it belongs to.

If it is important to you that a single shard represents a calander year, 
then you need to override the shard assignemnt algorithm -- either that, 
or use a distinct *collection* per claander year, and then do 
multi-collection queries when you want to execute queries across multiple 
years ... it all depends on what your "common case" queries are going to 
look like...

https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
https://cwiki.apache.org/confluence/display/solr/Distributed+Requests

: One thing I noticed is that three of them are leaders in the SolrCloud
: admin UI graph.  Is that normal?

if you have 4 shards, then there should be 4 leaders -- leaders are about 
coordinating the duplicate physical copies of each doc in each replica of 
hte logical shard -- if you only have 1 phyiscal replica of each logical 
shard, then every replica is it's own leader.


-Hoss

Re: What gets written to the other shards?

Posted by Thierry Thelliez <th...@gmail.com>.
To answer my own post about the subtle difference between the shard and
replicate examples, it looks like the difference is in the numShards
parameter.

If you define numShards to be = 2, and then creating more shards than 2
will give you replicates.  Is that correct?

If that is the case, I think that my settings are correct.   I still do not
explain why I have such growth on all the shards at the same time.

One thing I noticed is that three of them are leaders in the SolrCloud
admin UI graph.  Is that normal?


Thierry




On Mon, Aug 12, 2013 at 5:39 PM, Thierry Thelliez <
thierry.thelliez.tech@gmail.com> wrote:

>
> Thanks Shawn for the detailed instructions.
>
> About the router: it is implicit.
>
> About the replicas: I followed the example at
> http://wiki.apache.org/solr/SolrCloud
>
> I start the shards with the following (paths and ports simplified):
>
> cd /.../solr/shard1/
> /usr/bin/java -Djetty.port=1 -Dbootstrap_confdir=./solr/collection1/conf
> -Dcollection.configName=myconf -DzkRun=localhost:0 -DnumShards=4 -jar
> start.jar > /.../log/shard_1.log
>
> cd /.../solr/shard2/
> /usr/bin/java -Djetty.port=2 -DzkHost=localhost:0 -jar start.jar >
> /.../log/shard_2.log
>
> and same thing for the two other shards on their own ports.
>
>
> To post a document (CSV file), I use:
>
> curl http://localhost:shardport/solr/update --data-binary file.csv
> -H 'Content-type:text/csv; charset=ISO-8859-1'
>
>
> I just re-read the example page  at http://wiki.apache.org/solr/SolrCloud   and I see that there is no difference between starting a shard or a
> replicate.  I must be missing something:
>
> From exampleA (two shards):
>
> cd example2
>
> java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
>
> Fomr exampleB (two shards with replicates):
>
> cd exampleB
>
> java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar
>
> Thanks.
> Thierry
>
>
>
>
>
>
>
>
>
>
> On Mon, Aug 12, 2013 at 5:04 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 8/12/2013 4:50 PM, Thierry Thelliez wrote:
>>
>>> Hello,  I am trying to set a four shard system for the first time.  I do
>>> not understand why all the shards data are growing at about the same rate
>>> when I push the documents to only one shard.
>>>
>>> The four shards represent four calendar years.  And for now, on a
>>> development machine, these four shards run on four different ports.
>>>
>>> The first shard is started with Zookeeper.
>>>
>>> The log of the other shards is filed with something like:
>>>
>>> 7882051 [qtp1154079020-1245] INFO
>>> org.apache.solr.update.**processor.LogUpdateProcessor – [collection1]
>>> webapp=/solr path=/update params={distrib.from=
>>> http://x.y.z.4:50121/solr/**collection1/&update.distrib=**
>>> TOLEADER&wt=javabin&version=2<http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2>
>>> }
>>> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
>>> (1443204912179576832), 14939-96467-310 (1443204912185868288),
>>> 14939-96467-311 (1443204912192159744), 14939-96467-313
>>> (1443204912204742656), 14939-96467-314 (1443204912220471296),
>>> 14939-96467-318 (1443204912239345664), 14939-96467-319
>>> (1443204912250880000), 14939-96467-322 (1443204912257171456),
>>> 14939-96467-324 (1443204912263462912)]} 0 282
>>>
>>> What is getting written to the other shards? Is a separate index computed
>>> on all four shards?  I thought that when pushing a document to one shard,
>>> only that shard would update its index.
>>>
>>
>> There are two possibilities.
>>
>> 1) You don't have four shards, you have four replicas of one shard.  If
>> this is happening, then they all will receive all documents.
>>
>> 2) You are using a router like compositeId instead of implicit.  This
>> will calculate the hash of the id field and evenly divide the documents
>> among all the shards in the collection according to the hash value.  If you
>> create the collection with the implicit router, then documents should be
>> indexed by the shard that received them.
>>
>> To see what router you have, click on Cloud in the admin UI, then click
>> on Tree.  Click the arrow to the left of '/collections' to open it. Click
>> on collection1 (or whichever you are actually using) -- the actual name,
>> not the arrow.  Underneath the table that appears to the right will be
>> "router" and its value.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: What gets written to the other shards?

Posted by Thierry Thelliez <th...@gmail.com>.
Thanks Shawn for the detailed instructions.

About the router: it is implicit.

About the replicas: I followed the example at
http://wiki.apache.org/solr/SolrCloud

I start the shards with the following (paths and ports simplified):

cd /.../solr/shard1/
/usr/bin/java -Djetty.port=1 -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun=localhost:0 -DnumShards=4 -jar
start.jar > /.../log/shard_1.log

cd /.../solr/shard2/
/usr/bin/java -Djetty.port=2 -DzkHost=localhost:0 -jar start.jar >
/.../log/shard_2.log

and same thing for the two other shards on their own ports.


To post a document (CSV file), I use:

curl http://localhost:shardport/solr/update --data-binary file.csv
-H 'Content-type:text/csv; charset=ISO-8859-1'


I just re-read the example page  at http://wiki.apache.org/solr/SolrCloud
 and I see that there is no difference between starting a shard or a
replicate.  I must be missing something:

>From exampleA (two shards):

cd example2

java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

Fomr exampleB (two shards with replicates):

cd exampleB

java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar

Thanks.
Thierry










On Mon, Aug 12, 2013 at 5:04 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 8/12/2013 4:50 PM, Thierry Thelliez wrote:
>
>> Hello,  I am trying to set a four shard system for the first time.  I do
>> not understand why all the shards data are growing at about the same rate
>> when I push the documents to only one shard.
>>
>> The four shards represent four calendar years.  And for now, on a
>> development machine, these four shards run on four different ports.
>>
>> The first shard is started with Zookeeper.
>>
>> The log of the other shards is filed with something like:
>>
>> 7882051 [qtp1154079020-1245] INFO
>> org.apache.solr.update.**processor.LogUpdateProcessor – [collection1]
>> webapp=/solr path=/update params={distrib.from=
>> http://x.y.z.4:50121/solr/**collection1/&update.distrib=**
>> TOLEADER&wt=javabin&version=2<http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2>
>> }
>> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
>> (1443204912179576832), 14939-96467-310 (1443204912185868288),
>> 14939-96467-311 (1443204912192159744), 14939-96467-313
>> (1443204912204742656), 14939-96467-314 (1443204912220471296),
>> 14939-96467-318 (1443204912239345664), 14939-96467-319
>> (1443204912250880000), 14939-96467-322 (1443204912257171456),
>> 14939-96467-324 (1443204912263462912)]} 0 282
>>
>> What is getting written to the other shards? Is a separate index computed
>> on all four shards?  I thought that when pushing a document to one shard,
>> only that shard would update its index.
>>
>
> There are two possibilities.
>
> 1) You don't have four shards, you have four replicas of one shard.  If
> this is happening, then they all will receive all documents.
>
> 2) You are using a router like compositeId instead of implicit.  This will
> calculate the hash of the id field and evenly divide the documents among
> all the shards in the collection according to the hash value.  If you
> create the collection with the implicit router, then documents should be
> indexed by the shard that received them.
>
> To see what router you have, click on Cloud in the admin UI, then click on
> Tree.  Click the arrow to the left of '/collections' to open it. Click on
> collection1 (or whichever you are actually using) -- the actual name, not
> the arrow.  Underneath the table that appears to the right will be "router"
> and its value.
>
> Thanks,
> Shawn
>
>

Re: What gets written to the other shards?

Posted by Shawn Heisey <so...@elyograg.org>.
On 8/12/2013 4:50 PM, Thierry Thelliez wrote:
> Hello,  I am trying to set a four shard system for the first time.  I do
> not understand why all the shards data are growing at about the same rate
> when I push the documents to only one shard.
>
> The four shards represent four calendar years.  And for now, on a
> development machine, these four shards run on four different ports.
>
> The first shard is started with Zookeeper.
>
> The log of the other shards is filed with something like:
>
> 7882051 [qtp1154079020-1245] INFO
> org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
> webapp=/solr path=/update params={distrib.from=
> http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
> (1443204912179576832), 14939-96467-310 (1443204912185868288),
> 14939-96467-311 (1443204912192159744), 14939-96467-313
> (1443204912204742656), 14939-96467-314 (1443204912220471296),
> 14939-96467-318 (1443204912239345664), 14939-96467-319
> (1443204912250880000), 14939-96467-322 (1443204912257171456),
> 14939-96467-324 (1443204912263462912)]} 0 282
>
> What is getting written to the other shards? Is a separate index computed
> on all four shards?  I thought that when pushing a document to one shard,
> only that shard would update its index.

There are two possibilities.

1) You don't have four shards, you have four replicas of one shard.  If 
this is happening, then they all will receive all documents.

2) You are using a router like compositeId instead of implicit.  This 
will calculate the hash of the id field and evenly divide the documents 
among all the shards in the collection according to the hash value.  If 
you create the collection with the implicit router, then documents 
should be indexed by the shard that received them.

To see what router you have, click on Cloud in the admin UI, then click 
on Tree.  Click the arrow to the left of '/collections' to open it. 
Click on collection1 (or whichever you are actually using) -- the actual 
name, not the arrow.  Underneath the table that appears to the right 
will be "router" and its value.

Thanks,
Shawn