You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thierry Thelliez <th...@gmail.com> on 2013/08/13 00:50:18 UTC
What gets written to the other shards?
Hello, I am trying to set a four shard system for the first time. I do
not understand why all the shards data are growing at about the same rate
when I push the documents to only one shard.
The four shards represent four calendar years. And for now, on a
development machine, these four shards run on four different ports.
The first shard is started with Zookeeper.
The log of the other shards is filed with something like:
7882051 [qtp1154079020-1245] INFO
org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
webapp=/solr path=/update params={distrib.from=
http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
{add=[14939-96467-304 (1443204912169091072), 14939-96467-308
(1443204912179576832), 14939-96467-310 (1443204912185868288),
14939-96467-311 (1443204912192159744), 14939-96467-313
(1443204912204742656), 14939-96467-314 (1443204912220471296),
14939-96467-318 (1443204912239345664), 14939-96467-319
(1443204912250880000), 14939-96467-322 (1443204912257171456),
14939-96467-324 (1443204912263462912)]} 0 282
What is getting written to the other shards? Is a separate index computed
on all four shards? I thought that when pushing a document to one shard,
only that shard would update its index.
Thanks,
Thierry
Re: What gets written to the other shards?
Posted by Thierry Thelliez <th...@gmail.com>.
Greg, what do you mean by 'manually setting the shard on each document'?
I explicitly push the documents to their respective shard/port numbers.
Something like
curl http://localhost:shardport/solr/update --data-binary file.csv -H
'Content-type:text/csv; charset=ISO-8859-1'
I guess that this is making the routing 'implicit'?
For example I am pushing 2M rows from CSV files to one shard (About 2GB).
The upload is not finished but I am seeing all the shards data directory
at about 14GB... I was expecting only the current shard to grow.
Thierry
On Mon, Aug 12, 2013 at 4:59 PM, Greg Preston <gp...@marinsoftware.com>wrote:
> Are you manually setting the shard on each document? If not, documents
> will be hashed across all the shards.
>
> -Greg
>
>
> On Mon, Aug 12, 2013 at 3:50 PM, Thierry Thelliez <
> thierry.thelliez.tech@gmail.com> wrote:
>
> > Hello, I am trying to set a four shard system for the first time. I do
> > not understand why all the shards data are growing at about the same rate
> > when I push the documents to only one shard.
> >
> > The four shards represent four calendar years. And for now, on a
> > development machine, these four shards run on four different ports.
> >
> > The first shard is started with Zookeeper.
> >
> > The log of the other shards is filed with something like:
> >
> > 7882051 [qtp1154079020-1245] INFO
> > org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
> > webapp=/solr path=/update params={distrib.from=
> >
> >
> http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2
> > }
> > {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
> > (1443204912179576832), 14939-96467-310 (1443204912185868288),
> > 14939-96467-311 (1443204912192159744), 14939-96467-313
> > (1443204912204742656), 14939-96467-314 (1443204912220471296),
> > 14939-96467-318 (1443204912239345664), 14939-96467-319
> > (1443204912250880000), 14939-96467-322 (1443204912257171456),
> > 14939-96467-324 (1443204912263462912)]} 0 282
> >
> > What is getting written to the other shards? Is a separate index computed
> > on all four shards? I thought that when pushing a document to one shard,
> > only that shard would update its index.
> >
> >
> > Thanks,
> > Thierry
> >
>
Re: What gets written to the other shards?
Posted by Greg Preston <gp...@marinsoftware.com>.
Are you manually setting the shard on each document? If not, documents
will be hashed across all the shards.
-Greg
On Mon, Aug 12, 2013 at 3:50 PM, Thierry Thelliez <
thierry.thelliez.tech@gmail.com> wrote:
> Hello, I am trying to set a four shard system for the first time. I do
> not understand why all the shards data are growing at about the same rate
> when I push the documents to only one shard.
>
> The four shards represent four calendar years. And for now, on a
> development machine, these four shards run on four different ports.
>
> The first shard is started with Zookeeper.
>
> The log of the other shards is filed with something like:
>
> 7882051 [qtp1154079020-1245] INFO
> org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
> webapp=/solr path=/update params={distrib.from=
>
> http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2
> }
> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
> (1443204912179576832), 14939-96467-310 (1443204912185868288),
> 14939-96467-311 (1443204912192159744), 14939-96467-313
> (1443204912204742656), 14939-96467-314 (1443204912220471296),
> 14939-96467-318 (1443204912239345664), 14939-96467-319
> (1443204912250880000), 14939-96467-322 (1443204912257171456),
> 14939-96467-324 (1443204912263462912)]} 0 282
>
> What is getting written to the other shards? Is a separate index computed
> on all four shards? I thought that when pushing a document to one shard,
> only that shard would update its index.
>
>
> Thanks,
> Thierry
>
Re: What gets written to the other shards?
Posted by Chris Hostetter <ho...@fucit.org>.
: "Collections that do not specify numShards at collection creation time use
: custom sharding and default to the "implicit" router. Document updates
: received by a shard will be indexed to that shard, unless a "*shard*"
: parameter or document field names a different shard."
the word "implicit" there may be somewhat confusin -- but yes otherwise
that statement is correct (as i understand it). If you don't tell Solr in
advance how many shards you want the collection to have, then it can't
compute the shard assignments for you, and you are responsible for sending
each doc to the appropriate shard.
The other approach i was refering to is the way to explicitly override the
document routing, even if you have a fixed number of shards -- there's a
Java API you can use to do this (i don't recall what it is off the top of
my head) but there is also a a mechanism using the default router code
where having a "!" in your uniqueKey field will cause only the portion of
the ID preceding the "!" to be used in computing the shard -- see the link
i mentioned earlier.
: I do specify the numShards parameter and that might be the confusion. The
: admin UI is telling me that the routing is 'implicit' when it might not be?
: Shouldn't it be compositeID routing if I use numShards?
Again -- the terminology might be confusion .. i'm certainly confused, i'm
not sure off the top of my head which one is refered to as "implicit".
: > https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
: > https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
-Hoss
Re: What gets written to the other shards?
Posted by Thierry Thelliez <th...@gmail.com>.
Hoss,
>From the Solr 4.1 release highlights, under the SolrCloud enhancements
section:
"Collections that do not specify numShards at collection creation time use
custom sharding and default to the "implicit" router. Document updates
received by a shard will be indexed to that shard, unless a "*shard*"
parameter or document field names a different shard."
So it does appear that it is possible to use implicit document routing
under SolrCloud.
I do specify the numShards parameter and that might be the confusion. The
admin UI is telling me that the routing is 'implicit' when it might not be?
Shouldn't it be compositeID routing if I use numShards?
It might be that the change in 4.1 was not clear?
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3C0AA0B422-F1DE-4915-B602-53CB1849204A@gmail.com%3E
I think that I need to experiment without the numShards parameter.
Ultimately what I want is that all the shards (=years) be queried if no
shard is explicitly specified. But eventually the users will be able to
pick a given date range. Then we want to query only the matching shards.
Better ways to do it?
-Thierry
On Mon, Aug 12, 2013 at 7:39 PM, Chris Hostetter
<ho...@fucit.org>wrote:
>
> : If that is the case, I think that my settings are correct. I still do
> not
> : explain why I have such growth on all the shards at the same time.
>
> you are missunderstanding how SolrCLoud works.
>
> Unless you go out of your way to override hte document routing, Solr will
> compute a logical shard to assign each doc to using a hash on the id -- it
> doesn't matter which physical node you send the doc to, solr will
> internally forward it to the correct physical nodes of the logical shard
> it belongs to.
>
> If it is important to you that a single shard represents a calander year,
> then you need to override the shard assignemnt algorithm -- either that,
> or use a distinct *collection* per claander year, and then do
> multi-collection queries when you want to execute queries across multiple
> years ... it all depends on what your "common case" queries are going to
> look like...
>
>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
> https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
>
> : One thing I noticed is that three of them are leaders in the SolrCloud
> : admin UI graph. Is that normal?
>
> if you have 4 shards, then there should be 4 leaders -- leaders are about
> coordinating the duplicate physical copies of each doc in each replica of
> hte logical shard -- if you only have 1 phyiscal replica of each logical
> shard, then every replica is it's own leader.
>
>
> -Hoss
>
Re: What gets written to the other shards?
Posted by Chris Hostetter <ho...@fucit.org>.
: If that is the case, I think that my settings are correct. I still do not
: explain why I have such growth on all the shards at the same time.
you are missunderstanding how SolrCLoud works.
Unless you go out of your way to override hte document routing, Solr will
compute a logical shard to assign each doc to using a hash on the id -- it
doesn't matter which physical node you send the doc to, solr will
internally forward it to the correct physical nodes of the logical shard
it belongs to.
If it is important to you that a single shard represents a calander year,
then you need to override the shard assignemnt algorithm -- either that,
or use a distinct *collection* per claander year, and then do
multi-collection queries when you want to execute queries across multiple
years ... it all depends on what your "common case" queries are going to
look like...
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
: One thing I noticed is that three of them are leaders in the SolrCloud
: admin UI graph. Is that normal?
if you have 4 shards, then there should be 4 leaders -- leaders are about
coordinating the duplicate physical copies of each doc in each replica of
hte logical shard -- if you only have 1 phyiscal replica of each logical
shard, then every replica is it's own leader.
-Hoss
Re: What gets written to the other shards?
Posted by Thierry Thelliez <th...@gmail.com>.
To answer my own post about the subtle difference between the shard and
replicate examples, it looks like the difference is in the numShards
parameter.
If you define numShards to be = 2, and then creating more shards than 2
will give you replicates. Is that correct?
If that is the case, I think that my settings are correct. I still do not
explain why I have such growth on all the shards at the same time.
One thing I noticed is that three of them are leaders in the SolrCloud
admin UI graph. Is that normal?
Thierry
On Mon, Aug 12, 2013 at 5:39 PM, Thierry Thelliez <
thierry.thelliez.tech@gmail.com> wrote:
>
> Thanks Shawn for the detailed instructions.
>
> About the router: it is implicit.
>
> About the replicas: I followed the example at
> http://wiki.apache.org/solr/SolrCloud
>
> I start the shards with the following (paths and ports simplified):
>
> cd /.../solr/shard1/
> /usr/bin/java -Djetty.port=1 -Dbootstrap_confdir=./solr/collection1/conf
> -Dcollection.configName=myconf -DzkRun=localhost:0 -DnumShards=4 -jar
> start.jar > /.../log/shard_1.log
>
> cd /.../solr/shard2/
> /usr/bin/java -Djetty.port=2 -DzkHost=localhost:0 -jar start.jar >
> /.../log/shard_2.log
>
> and same thing for the two other shards on their own ports.
>
>
> To post a document (CSV file), I use:
>
> curl http://localhost:shardport/solr/update --data-binary file.csv
> -H 'Content-type:text/csv; charset=ISO-8859-1'
>
>
> I just re-read the example page at http://wiki.apache.org/solr/SolrCloud and I see that there is no difference between starting a shard or a
> replicate. I must be missing something:
>
> From exampleA (two shards):
>
> cd example2
>
> java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
>
> Fomr exampleB (two shards with replicates):
>
> cd exampleB
>
> java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar
>
> Thanks.
> Thierry
>
>
>
>
>
>
>
>
>
>
> On Mon, Aug 12, 2013 at 5:04 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 8/12/2013 4:50 PM, Thierry Thelliez wrote:
>>
>>> Hello, I am trying to set a four shard system for the first time. I do
>>> not understand why all the shards data are growing at about the same rate
>>> when I push the documents to only one shard.
>>>
>>> The four shards represent four calendar years. And for now, on a
>>> development machine, these four shards run on four different ports.
>>>
>>> The first shard is started with Zookeeper.
>>>
>>> The log of the other shards is filed with something like:
>>>
>>> 7882051 [qtp1154079020-1245] INFO
>>> org.apache.solr.update.**processor.LogUpdateProcessor – [collection1]
>>> webapp=/solr path=/update params={distrib.from=
>>> http://x.y.z.4:50121/solr/**collection1/&update.distrib=**
>>> TOLEADER&wt=javabin&version=2<http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2>
>>> }
>>> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
>>> (1443204912179576832), 14939-96467-310 (1443204912185868288),
>>> 14939-96467-311 (1443204912192159744), 14939-96467-313
>>> (1443204912204742656), 14939-96467-314 (1443204912220471296),
>>> 14939-96467-318 (1443204912239345664), 14939-96467-319
>>> (1443204912250880000), 14939-96467-322 (1443204912257171456),
>>> 14939-96467-324 (1443204912263462912)]} 0 282
>>>
>>> What is getting written to the other shards? Is a separate index computed
>>> on all four shards? I thought that when pushing a document to one shard,
>>> only that shard would update its index.
>>>
>>
>> There are two possibilities.
>>
>> 1) You don't have four shards, you have four replicas of one shard. If
>> this is happening, then they all will receive all documents.
>>
>> 2) You are using a router like compositeId instead of implicit. This
>> will calculate the hash of the id field and evenly divide the documents
>> among all the shards in the collection according to the hash value. If you
>> create the collection with the implicit router, then documents should be
>> indexed by the shard that received them.
>>
>> To see what router you have, click on Cloud in the admin UI, then click
>> on Tree. Click the arrow to the left of '/collections' to open it. Click
>> on collection1 (or whichever you are actually using) -- the actual name,
>> not the arrow. Underneath the table that appears to the right will be
>> "router" and its value.
>>
>> Thanks,
>> Shawn
>>
>>
>
Re: What gets written to the other shards?
Posted by Thierry Thelliez <th...@gmail.com>.
Thanks Shawn for the detailed instructions.
About the router: it is implicit.
About the replicas: I followed the example at
http://wiki.apache.org/solr/SolrCloud
I start the shards with the following (paths and ports simplified):
cd /.../solr/shard1/
/usr/bin/java -Djetty.port=1 -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun=localhost:0 -DnumShards=4 -jar
start.jar > /.../log/shard_1.log
cd /.../solr/shard2/
/usr/bin/java -Djetty.port=2 -DzkHost=localhost:0 -jar start.jar >
/.../log/shard_2.log
and same thing for the two other shards on their own ports.
To post a document (CSV file), I use:
curl http://localhost:shardport/solr/update --data-binary file.csv
-H 'Content-type:text/csv; charset=ISO-8859-1'
I just re-read the example page at http://wiki.apache.org/solr/SolrCloud
and I see that there is no difference between starting a shard or a
replicate. I must be missing something:
>From exampleA (two shards):
cd example2
java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
Fomr exampleB (two shards with replicates):
cd exampleB
java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar
Thanks.
Thierry
On Mon, Aug 12, 2013 at 5:04 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 8/12/2013 4:50 PM, Thierry Thelliez wrote:
>
>> Hello, I am trying to set a four shard system for the first time. I do
>> not understand why all the shards data are growing at about the same rate
>> when I push the documents to only one shard.
>>
>> The four shards represent four calendar years. And for now, on a
>> development machine, these four shards run on four different ports.
>>
>> The first shard is started with Zookeeper.
>>
>> The log of the other shards is filed with something like:
>>
>> 7882051 [qtp1154079020-1245] INFO
>> org.apache.solr.update.**processor.LogUpdateProcessor – [collection1]
>> webapp=/solr path=/update params={distrib.from=
>> http://x.y.z.4:50121/solr/**collection1/&update.distrib=**
>> TOLEADER&wt=javabin&version=2<http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2>
>> }
>> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
>> (1443204912179576832), 14939-96467-310 (1443204912185868288),
>> 14939-96467-311 (1443204912192159744), 14939-96467-313
>> (1443204912204742656), 14939-96467-314 (1443204912220471296),
>> 14939-96467-318 (1443204912239345664), 14939-96467-319
>> (1443204912250880000), 14939-96467-322 (1443204912257171456),
>> 14939-96467-324 (1443204912263462912)]} 0 282
>>
>> What is getting written to the other shards? Is a separate index computed
>> on all four shards? I thought that when pushing a document to one shard,
>> only that shard would update its index.
>>
>
> There are two possibilities.
>
> 1) You don't have four shards, you have four replicas of one shard. If
> this is happening, then they all will receive all documents.
>
> 2) You are using a router like compositeId instead of implicit. This will
> calculate the hash of the id field and evenly divide the documents among
> all the shards in the collection according to the hash value. If you
> create the collection with the implicit router, then documents should be
> indexed by the shard that received them.
>
> To see what router you have, click on Cloud in the admin UI, then click on
> Tree. Click the arrow to the left of '/collections' to open it. Click on
> collection1 (or whichever you are actually using) -- the actual name, not
> the arrow. Underneath the table that appears to the right will be "router"
> and its value.
>
> Thanks,
> Shawn
>
>
Re: What gets written to the other shards?
Posted by Shawn Heisey <so...@elyograg.org>.
On 8/12/2013 4:50 PM, Thierry Thelliez wrote:
> Hello, I am trying to set a four shard system for the first time. I do
> not understand why all the shards data are growing at about the same rate
> when I push the documents to only one shard.
>
> The four shards represent four calendar years. And for now, on a
> development machine, these four shards run on four different ports.
>
> The first shard is started with Zookeeper.
>
> The log of the other shards is filed with something like:
>
> 7882051 [qtp1154079020-1245] INFO
> org.apache.solr.update.processor.LogUpdateProcessor – [collection1]
> webapp=/solr path=/update params={distrib.from=
> http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308
> (1443204912179576832), 14939-96467-310 (1443204912185868288),
> 14939-96467-311 (1443204912192159744), 14939-96467-313
> (1443204912204742656), 14939-96467-314 (1443204912220471296),
> 14939-96467-318 (1443204912239345664), 14939-96467-319
> (1443204912250880000), 14939-96467-322 (1443204912257171456),
> 14939-96467-324 (1443204912263462912)]} 0 282
>
> What is getting written to the other shards? Is a separate index computed
> on all four shards? I thought that when pushing a document to one shard,
> only that shard would update its index.
There are two possibilities.
1) You don't have four shards, you have four replicas of one shard. If
this is happening, then they all will receive all documents.
2) You are using a router like compositeId instead of implicit. This
will calculate the hash of the id field and evenly divide the documents
among all the shards in the collection according to the hash value. If
you create the collection with the implicit router, then documents
should be indexed by the shard that received them.
To see what router you have, click on Cloud in the admin UI, then click
on Tree. Click the arrow to the left of '/collections' to open it.
Click on collection1 (or whichever you are actually using) -- the actual
name, not the arrow. Underneath the table that appears to the right
will be "router" and its value.
Thanks,
Shawn