You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Timothy Potter <th...@gmail.com> on 2012/08/02 17:08:41 UTC

SolrCloud MatchAllDocsQuery returning different number of docs each request

Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
impressed so far ...

I have a 12-shard index with ~104M docs with each shard having
1-replica (so 24 Solr servers running)

Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
(*:*) and each time I send the request the value for numFound in the
result is different. It's always close but not exactly the same as I
would expect? Can anyone shed some light on this issue? I also tried a
real query, such as "#olympics lochte" and same thing - different
numFound each time. The first page of actual docs returned is the same
so maybe I should just ignore the numFound issue?

Note that while experiencing this behavior, I am not adding any docs
to the index and all docs have been committed with waitFlush=true and
waitSearcher=true on the commit. Also, not doing soft commits at this
point. In addition, after having committed all 104M docs, I hit the
optimize button the panel so I have only 1 segment. In other words,
the index is not being updated and has been optimized at this point.

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Posted by Timothy Potter <th...@gmail.com>.

Sorry, I didn't answer your other questions about shards being
in-sync. Yes - all are green and happy according to the Cloud admin
panel.

Tim

On Thu, Aug 2, 2012 at 12:16 PM, Timothy Potter <th...@gmail.com> wrote:
> Thanks Mark.
>
> I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:
>
> Collection<SolrInputDocument> batch = ...
> ... build up batch ...
> solrServer.add( batch );
>
> Basically, I have a custom Pig StoreFunc that sends docs to Solr from
> our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
> is that I couldn't get it to run in my Hadoop environment. There's
> some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
> 4.1.3 but when I run it in my env, I get the following:
>
> Caused by: java.lang.NoSuchMethodError:
> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
> <init>()V not found
>         at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
>         at org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
>         ... 16 more
>
> I spent hours trying to resolve the classpath issue and finally had to
> bail and just used the 3.4 SolrJ client as I'm just at the evaluation
> stage at this point. So it sounds like this could be the cause of my
> problems.
>
> One other thing ... I do have the _version_ field defined in my
> schema.xml but am not setting it on the client side when indexing.
> Should I be doing that?
>
> Cheers,
> Tim
>
>
> On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <ma...@gmail.com> wrote:
>>
>> On Aug 2, 2012, at 11:08 AM, Timothy Potter <th...@gmail.com> wrote:
>>
>>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>>> impressed so far ...
>>>
>>> I have a 12-shard index with ~104M docs with each shard having
>>> 1-replica (so 24 Solr servers running)
>>>
>>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>>> (*:*) and each time I send the request the value for numFound in the
>>> result is different. It's always close but not exactly the same as I
>>> would expect? Can anyone shed some light on this issue? I also tried a
>>> real query, such as "#olympics lochte" and same thing - different
>>> numFound each time. The first page of actual docs returned is the same
>>> so maybe I should just ignore the numFound issue?
>>>
>>> Note that while experiencing this behavior, I am not adding any docs
>>> to the index and all docs have been committed with waitFlush=true and
>>> waitSearcher=true on the commit. Also, not doing soft commits at this
>>> point. In addition, after having committed all 104M docs, I hit the
>>> optimize button the panel so I have only 1 segment. In other words,
>>> the index is not being updated and has been optimized at this point.
>>
>>
>> How are you adding docs? Eg what client and what method in particular (what is your line of code that actually adds the doc).
>>
>> You can find the numFound result for each node by passing the param distrib=false. What does this tell you? Are your replicas in sync with the leader? What does the count for each shard add up to?
>>
>> I would not ignore the issue - something must be off. It may somehow be user error, it may be a bug that has been fixed since the alpha, or it may be something new.
>>
>> Are you sure every shard you are issuing the query *from* is active and live according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization, are all the nodes green?
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Posted by Mark Miller <ma...@gmail.com>.

FYI: I've committed the rest of the work I was doing on trunk in this area.

On Aug 2, 2012, at 4:42 PM, Timothy Potter <th...@gmail.com> wrote:

> Yes, I can but won't get to it today unfortunately. I had my eval
> environment running on some very expensive EC2 instances and shut it
> down for the time being until I can focus on it again. Will try to get
> back to this either tomorrow or over the weekend. Sorry for the delay.
> 
> Tim
> 
> On Thu, Aug 2, 2012 at 1:35 PM, Mark Miller <ma...@gmail.com> wrote:
>> Can you do me a favor and try not using the batch add for a run?
>> 
>> Just do the add one doc at a time. (solrServer.add(doc) rather than solrServer.add(collection))
>> 
>> I just fixed one issue with it this morning on trunk - it may be the cause of this oddity.
>> 
>> I'm also working on some performance issues around that method too (good performance without starting thousands of threads).
>> 
>> Until I get all that straightened out (hopefully very soon), I think you will have better luck not using the bulk, collection add method.
>> 
>> On Aug 2, 2012, at 2:16 PM, Timothy Potter <th...@gmail.com> wrote:
>> 
>>> Thanks Mark.
>>> 
>>> I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:
>>> 
>>> Collection<SolrInputDocument> batch = ...
>>> ... build up batch ...
>>> solrServer.add( batch );
>>> 
>>> Basically, I have a custom Pig StoreFunc that sends docs to Solr from
>>> our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
>>> is that I couldn't get it to run in my Hadoop environment. There's
>>> some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
>>> 4.1.3 but when I run it in my env, I get the following:
>>> 
>>> Caused by: java.lang.NoSuchMethodError:
>>> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
>>> <init>()V not found
>>>      at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
>>>      at org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
>>>      ... 16 more
>>> 
>>> I spent hours trying to resolve the classpath issue and finally had to
>>> bail and just used the 3.4 SolrJ client as I'm just at the evaluation
>>> stage at this point. So it sounds like this could be the cause of my
>>> problems.
>>> 
>>> One other thing ... I do have the _version_ field defined in my
>>> schema.xml but am not setting it on the client side when indexing.
>>> Should I be doing that?
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> 
>>> On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <ma...@gmail.com> wrote:
>>>> 
>>>> On Aug 2, 2012, at 11:08 AM, Timothy Potter <th...@gmail.com> wrote:
>>>> 
>>>>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>>>>> impressed so far ...
>>>>> 
>>>>> I have a 12-shard index with ~104M docs with each shard having
>>>>> 1-replica (so 24 Solr servers running)
>>>>> 
>>>>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>>>>> (*:*) and each time I send the request the value for numFound in the
>>>>> result is different. It's always close but not exactly the same as I
>>>>> would expect? Can anyone shed some light on this issue? I also tried a
>>>>> real query, such as "#olympics lochte" and same thing - different
>>>>> numFound each time. The first page of actual docs returned is the same
>>>>> so maybe I should just ignore the numFound issue?
>>>>> 
>>>>> Note that while experiencing this behavior, I am not adding any docs
>>>>> to the index and all docs have been committed with waitFlush=true and
>>>>> waitSearcher=true on the commit. Also, not doing soft commits at this
>>>>> point. In addition, after having committed all 104M docs, I hit the
>>>>> optimize button the panel so I have only 1 segment. In other words,
>>>>> the index is not being updated and has been optimized at this point.
>>>> 
>>>> 
>>>> How are you adding docs? Eg what client and what method in particular (what is your line of code that actually adds the doc).
>>>> 
>>>> You can find the numFound result for each node by passing the param distrib=false. What does this tell you? Are your replicas in sync with the leader? What does the count for each shard add up to?
>>>> 
>>>> I would not ignore the issue - something must be off. It may somehow be user error, it may be a bug that has been fixed since the alpha, or it may be something new.
>>>> 
>>>> Are you sure every shard you are issuing the query *from* is active and live according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization, are all the nodes green?
>>>> 
>>>> - Mark Miller
>>>> lucidimagination.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Posted by Timothy Potter <th...@gmail.com>.

Yes, I can but won't get to it today unfortunately. I had my eval
environment running on some very expensive EC2 instances and shut it
down for the time being until I can focus on it again. Will try to get
back to this either tomorrow or over the weekend. Sorry for the delay.

Tim

On Thu, Aug 2, 2012 at 1:35 PM, Mark Miller <ma...@gmail.com> wrote:
> Can you do me a favor and try not using the batch add for a run?
>
> Just do the add one doc at a time. (solrServer.add(doc) rather than solrServer.add(collection))
>
> I just fixed one issue with it this morning on trunk - it may be the cause of this oddity.
>
> I'm also working on some performance issues around that method too (good performance without starting thousands of threads).
>
> Until I get all that straightened out (hopefully very soon), I think you will have better luck not using the bulk, collection add method.
>
> On Aug 2, 2012, at 2:16 PM, Timothy Potter <th...@gmail.com> wrote:
>
>> Thanks Mark.
>>
>> I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:
>>
>> Collection<SolrInputDocument> batch = ...
>> ... build up batch ...
>> solrServer.add( batch );
>>
>> Basically, I have a custom Pig StoreFunc that sends docs to Solr from
>> our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
>> is that I couldn't get it to run in my Hadoop environment. There's
>> some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
>> 4.1.3 but when I run it in my env, I get the following:
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
>> <init>()V not found
>>       at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
>>       at org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
>>       ... 16 more
>>
>> I spent hours trying to resolve the classpath issue and finally had to
>> bail and just used the 3.4 SolrJ client as I'm just at the evaluation
>> stage at this point. So it sounds like this could be the cause of my
>> problems.
>>
>> One other thing ... I do have the _version_ field defined in my
>> schema.xml but am not setting it on the client side when indexing.
>> Should I be doing that?
>>
>> Cheers,
>> Tim
>>
>>
>> On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <ma...@gmail.com> wrote:
>>>
>>> On Aug 2, 2012, at 11:08 AM, Timothy Potter <th...@gmail.com> wrote:
>>>
>>>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>>>> impressed so far ...
>>>>
>>>> I have a 12-shard index with ~104M docs with each shard having
>>>> 1-replica (so 24 Solr servers running)
>>>>
>>>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>>>> (*:*) and each time I send the request the value for numFound in the
>>>> result is different. It's always close but not exactly the same as I
>>>> would expect? Can anyone shed some light on this issue? I also tried a
>>>> real query, such as "#olympics lochte" and same thing - different
>>>> numFound each time. The first page of actual docs returned is the same
>>>> so maybe I should just ignore the numFound issue?
>>>>
>>>> Note that while experiencing this behavior, I am not adding any docs
>>>> to the index and all docs have been committed with waitFlush=true and
>>>> waitSearcher=true on the commit. Also, not doing soft commits at this
>>>> point. In addition, after having committed all 104M docs, I hit the
>>>> optimize button the panel so I have only 1 segment. In other words,
>>>> the index is not being updated and has been optimized at this point.
>>>
>>>
>>> How are you adding docs? Eg what client and what method in particular (what is your line of code that actually adds the doc).
>>>
>>> You can find the numFound result for each node by passing the param distrib=false. What does this tell you? Are your replicas in sync with the leader? What does the count for each shard add up to?
>>>
>>> I would not ignore the issue - something must be off. It may somehow be user error, it may be a bug that has been fixed since the alpha, or it may be something new.
>>>
>>> Are you sure every shard you are issuing the query *from* is active and live according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization, are all the nodes green?
>>>
>>> - Mark Miller
>>> lucidimagination.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Posted by Mark Miller <ma...@gmail.com>.

Can you do me a favor and try not using the batch add for a run?

Just do the add one doc at a time. (solrServer.add(doc) rather than solrServer.add(collection))

I just fixed one issue with it this morning on trunk - it may be the cause of this oddity.

I'm also working on some performance issues around that method too (good performance without starting thousands of threads).

Until I get all that straightened out (hopefully very soon), I think you will have better luck not using the bulk, collection add method.

On Aug 2, 2012, at 2:16 PM, Timothy Potter <th...@gmail.com> wrote:

> Thanks Mark.
> 
> I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:
> 
> Collection<SolrInputDocument> batch = ...
> ... build up batch ...
> solrServer.add( batch );
> 
> Basically, I have a custom Pig StoreFunc that sends docs to Solr from
> our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
> is that I couldn't get it to run in my Hadoop environment. There's
> some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
> 4.1.3 but when I run it in my env, I get the following:
> 
> Caused by: java.lang.NoSuchMethodError:
> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
> <init>()V not found
> 	at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
> 	at org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
> 	... 16 more
> 
> I spent hours trying to resolve the classpath issue and finally had to
> bail and just used the 3.4 SolrJ client as I'm just at the evaluation
> stage at this point. So it sounds like this could be the cause of my
> problems.
> 
> One other thing ... I do have the _version_ field defined in my
> schema.xml but am not setting it on the client side when indexing.
> Should I be doing that?
> 
> Cheers,
> Tim
> 
> 
> On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <ma...@gmail.com> wrote:
>> 
>> On Aug 2, 2012, at 11:08 AM, Timothy Potter <th...@gmail.com> wrote:
>> 
>>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>>> impressed so far ...
>>> 
>>> I have a 12-shard index with ~104M docs with each shard having
>>> 1-replica (so 24 Solr servers running)
>>> 
>>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>>> (*:*) and each time I send the request the value for numFound in the
>>> result is different. It's always close but not exactly the same as I
>>> would expect? Can anyone shed some light on this issue? I also tried a
>>> real query, such as "#olympics lochte" and same thing - different
>>> numFound each time. The first page of actual docs returned is the same
>>> so maybe I should just ignore the numFound issue?
>>> 
>>> Note that while experiencing this behavior, I am not adding any docs
>>> to the index and all docs have been committed with waitFlush=true and
>>> waitSearcher=true on the commit. Also, not doing soft commits at this
>>> point. In addition, after having committed all 104M docs, I hit the
>>> optimize button the panel so I have only 1 segment. In other words,
>>> the index is not being updated and has been optimized at this point.
>> 
>> 
>> How are you adding docs? Eg what client and what method in particular (what is your line of code that actually adds the doc).
>> 
>> You can find the numFound result for each node by passing the param distrib=false. What does this tell you? Are your replicas in sync with the leader? What does the count for each shard add up to?
>> 
>> I would not ignore the issue - something must be off. It may somehow be user error, it may be a bug that has been fixed since the alpha, or it may be something new.
>> 
>> Are you sure every shard you are issuing the query *from* is active and live according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization, are all the nodes green?
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Posted by Timothy Potter <th...@gmail.com>.

Thanks Mark.

I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:

Collection<SolrInputDocument> batch = ...
... build up batch ...
solrServer.add( batch );

Basically, I have a custom Pig StoreFunc that sends docs to Solr from
our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
is that I couldn't get it to run in my Hadoop environment. There's
some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
4.1.3 but when I run it in my env, I get the following:

Caused by: java.lang.NoSuchMethodError:
org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
<init>()V not found
	at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
	at org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
	... 16 more

I spent hours trying to resolve the classpath issue and finally had to
bail and just used the 3.4 SolrJ client as I'm just at the evaluation
stage at this point. So it sounds like this could be the cause of my
problems.

One other thing ... I do have the _version_ field defined in my
schema.xml but am not setting it on the client side when indexing.
Should I be doing that?

Cheers,
Tim


On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <ma...@gmail.com> wrote:
>
> On Aug 2, 2012, at 11:08 AM, Timothy Potter <th...@gmail.com> wrote:
>
>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>> impressed so far ...
>>
>> I have a 12-shard index with ~104M docs with each shard having
>> 1-replica (so 24 Solr servers running)
>>
>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>> (*:*) and each time I send the request the value for numFound in the
>> result is different. It's always close but not exactly the same as I
>> would expect? Can anyone shed some light on this issue? I also tried a
>> real query, such as "#olympics lochte" and same thing - different
>> numFound each time. The first page of actual docs returned is the same
>> so maybe I should just ignore the numFound issue?
>>
>> Note that while experiencing this behavior, I am not adding any docs
>> to the index and all docs have been committed with waitFlush=true and
>> waitSearcher=true on the commit. Also, not doing soft commits at this
>> point. In addition, after having committed all 104M docs, I hit the
>> optimize button the panel so I have only 1 segment. In other words,
>> the index is not being updated and has been optimized at this point.
>
>
> How are you adding docs? Eg what client and what method in particular (what is your line of code that actually adds the doc).
>
> You can find the numFound result for each node by passing the param distrib=false. What does this tell you? Are your replicas in sync with the leader? What does the count for each shard add up to?
>
> I would not ignore the issue - something must be off. It may somehow be user error, it may be a bug that has been fixed since the alpha, or it may be something new.
>
> Are you sure every shard you are issuing the query *from* is active and live according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization, are all the nodes green?
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Posted by Mark Miller <ma...@gmail.com>.

On Aug 2, 2012, at 11:08 AM, Timothy Potter <th...@gmail.com> wrote:

> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
> impressed so far ...
> 
> I have a 12-shard index with ~104M docs with each shard having
> 1-replica (so 24 Solr servers running)
> 
> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
> (*:*) and each time I send the request the value for numFound in the
> result is different. It's always close but not exactly the same as I
> would expect? Can anyone shed some light on this issue? I also tried a
> real query, such as "#olympics lochte" and same thing - different
> numFound each time. The first page of actual docs returned is the same
> so maybe I should just ignore the numFound issue?
> 
> Note that while experiencing this behavior, I am not adding any docs
> to the index and all docs have been committed with waitFlush=true and
> waitSearcher=true on the commit. Also, not doing soft commits at this
> point. In addition, after having committed all 104M docs, I hit the
> optimize button the panel so I have only 1 segment. In other words,
> the index is not being updated and has been optimized at this point.

How are you adding docs? Eg what client and what method in particular (what is your line of code that actually adds the doc).

You can find the numFound result for each node by passing the param distrib=false. What does this tell you? Are your replicas in sync with the leader? What does the count for each shard add up to?

I would not ignore the issue - something must be off. It may somehow be user error, it may be a bug that has been fixed since the alpha, or it may be something new.

Are you sure every shard you are issuing the query *from* is active and live according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization, are all the nodes green?

- Mark Miller
lucidimagination.com