You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yashveer Rana <ca...@gmail.com> on 2015/01/02 15:43:34 UTC

Inconsistent document addition

I have a solr cloud setup with two collections A & B with different schemas (
although majority of fields are identical ). 
Collection A has ~ 3.6 million documents
Using *solrj 4.7.0 *

As per a requirement, my application 
- reads documents from collection A in batches of 10k
- Creates docs of type B, populates fields from the A type docs
- Calls addBeans() on collection A in batches of 500 and invokes commit

However, this operation does not add all documents to collection B and falls
short by about 80-90k. On re-executing the operation, there is an increment
in the doc count, but it still does not reach the desired number. On
multiple executions, eventually the count reaches the 3.6 figure

Just wondering if anyone has encountered such a behaviour before. Havent
seen any errors in solr logs generated either.





--
View this message in context: http://lucene.472066.n3.nabble.com/Inconsistent-document-addition-tp4177013.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Inconsistent document addition

Posted by Erick Erickson <er...@gmail.com>.
Really impossible to say, assuming you're generating correctly-formed
documents I don't see how this would fail. So, here's how I'd approach it:

You're assuming that
1> you're getting all the docs back from server A that you have in there
and
2> you're correctly sending them all to server B

So my guess is that one of these assumptions is somehow wrong, which
leaves checking them as "an exercise for the reader".

I should think you'd need to put some instrumentation in your SolrJ program.

> Simplest is just record the number of docs you read from server A. Is it the
correct number?

> Record the number of docs you send to server B. Does it match the number
read from server A?

> Record all the IDs (whatever uniqueKey is) in a Set and report the number at
the end. Does it match the count of docs you read from server A? If not, somehow
you're getting duplicate docs.


Something I've done repeatedly is a silly mistake like

while (more docs) {
   add the doc to the doc list
   if (doclist.size() > 500) {
      server.add(doclist);
      doclist.clear();
   }
}

then fail to do the following outside the while loop to catch
the docs I've added to the doclist but not sent because size < 500.

if (doclist.size() > 0) {
   server.add(doclist);
}

Although why running your SolrJ program repeatedly
would "catch up" server B is hard to reconcile with an error like
this.


This isn't germane to your problem, but generally it's a poor practice to have
the SolrJ program commit after sending a batch of docs to the server, let your
autocommit settings handle that with (possibly) a single commit at the very
end of the run before you exit.

Best,
Erick

On Fri, Jan 2, 2015 at 6:43 AM, Yashveer Rana <ca...@gmail.com> wrote:
> I have a solr cloud setup with two collections A & B with different schemas (
> although majority of fields are identical ).
> Collection A has ~ 3.6 million documents
> Using *solrj 4.7.0 *
>
> As per a requirement, my application
> - reads documents from collection A in batches of 10k
> - Creates docs of type B, populates fields from the A type docs
> - Calls addBeans() on collection A in batches of 500 and invokes commit
>
> However, this operation does not add all documents to collection B and falls
> short by about 80-90k. On re-executing the operation, there is an increment
> in the doc count, but it still does not reach the desired number. On
> multiple executions, eventually the count reaches the 3.6 figure
>
> Just wondering if anyone has encountered such a behaviour before. Havent
> seen any errors in solr logs generated either.
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Inconsistent-document-addition-tp4177013.html
> Sent from the Solr - User mailing list archive at Nabble.com.