You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Timothy Potter <th...@gmail.com> on 2015/03/10 17:09:26 UTC

Num docs, block join, and dupes?

Before I open a JIRA, I wanted to put this out to solicit feedback on what
I'm seeing and what Solr should be doing. So I've indexed the following 8
docs into a 2-shard collection (Solr 4.8'ish - internal custom branch
roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have
dup'd keys:

[
  {
    "id":"1",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"1-1",
        "name":"child"
      },
      {
        "id":"1-2",
        "name":"child"
      }
    ]
  },
  {
    "id":"2",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"2-1",
        "name":"child",
        "_childDocuments_":[
          {
            "id":"2-1-1",
            "name":"grandchild"
          },
          {
            "id":"2-1-1",
            "name":"grandchild2"
          },
          {
            "id":"2-1-1",
            "name":"grandchild3"
          }
        ]
      }
    ]
  }
]

When I query this collection, using:

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10

I get:

{
  "responseHeader":{
    "status":0,
    "QTime":9,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "wt":"json",
      "rows":"10"}},
  "shards.info":{
    "http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/":{
      "numFound":3,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8984/solr/blockjoin2_shard1_replica1",
      "time":4},
    "http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/":{
      "numFound":5,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8985/solr/blockjoin2_shard2_replica2",
      "time":4}},
  "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
      {
        "id":"1-1",
        "name":"child"},
      {
        "id":"1-2",
        "name":"child"},
      {
        "id":"1",
        "name":"parent",
        "_version_":1495272401329455104},
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}


So Solr has de-duped the results.

If I execute this query against the shard that has the dupes (distrib=false):

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false

Then the dupes are returned:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "distrib":"false",
      "wt":"json",
      "rows":"10"}},
  "response":{"numFound":5,"start":0,"docs":[
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1-1",
        "name":"grandchild2"},
      {
        "id":"2-1-1",
        "name":"grandchild3"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}

So I guess my question is why doesn't the non-distrib query do
de-duping? Mainly confirming this is how it's supposed to work and
this behavior doesn't strike anyone else as odd ;-)

Cheers,

Tim

Re: Num docs, block join, and dupes?

Posted by Jessica Mallet <me...@gmail.com>.

We've seen this as well. Before we understood the cause, it seemed very
bizarre that hitting different nodes would yield different numFound, as
well as using different rows=N (since the proxying node only de-dupe the
documents that are returned in the response).

I think "consistency" and "correctness" should be clearly delineated. Of
course we'd rather have consistently correct result, but failing that, I'd
rather have consistently incorrect result rather than inconsistent results
because otherwise it's even hard to debug, as was the case here.

I think either the node hosting the shard should also do the de-duping, or
no one should. It's strange that the proxying node decides to do some
sketchy limited result set de-dupe.

On Tue, Mar 10, 2015 at 9:09 AM, Timothy Potter <th...@gmail.com>
wrote:
>
> Before I open a JIRA, I wanted to put this out to solicit feedback on what
> I'm seeing and what Solr should be doing. So I've indexed the following 8
> docs into a 2-shard collection (Solr 4.8'ish - internal custom branch
> roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have
> dup'd keys:
>
> [
>   {
>     "id":"1",
>     "name":"parent",
>     "_childDocuments_":[
>       {
>         "id":"1-1",
>         "name":"child"
>       },
>       {
>         "id":"1-2",
>         "name":"child"
>       }
>     ]
>   },
>   {
>     "id":"2",
>     "name":"parent",
>     "_childDocuments_":[
>       {
>         "id":"2-1",
>         "name":"child",
>         "_childDocuments_":[
>           {
>             "id":"2-1-1",
>             "name":"grandchild"
>           },
>           {
>             "id":"2-1-1",
>             "name":"grandchild2"
>           },
>           {
>             "id":"2-1-1",
>             "name":"grandchild3"
>           }
>         ]
>       }
>     ]
>   }
> ]
>
> When I query this collection, using:
>
>
http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10
>
> I get:
>
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":9,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "shards.info":"true",
>       "wt":"json",
>       "rows":"10"}},
>   "shards.info":{
>     "
http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/
":{
>       "numFound":3,
>       "maxScore":1.0,
>       "shardAddress":"
http://localhost:8984/solr/blockjoin2_shard1_replica1",
>       "time":4},
>     "
http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/
":{
>       "numFound":5,
>       "maxScore":1.0,
>       "shardAddress":"
http://localhost:8985/solr/blockjoin2_shard2_replica2",
>       "time":4}},
>   "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
>       {
>         "id":"1-1",
>         "name":"child"},
>       {
>         "id":"1-2",
>         "name":"child"},
>       {
>         "id":"1",
>         "name":"parent",
>         "_version_":1495272401329455104},
>       {
>         "id":"2-1-1",
>         "name":"grandchild"},
>       {
>         "id":"2-1",
>         "name":"child"},
>       {
>         "id":"2",
>         "name":"parent",
>         "_version_":1495272401361960960}]
>   }}
>
>
> So Solr has de-duped the results.
>
> If I execute this query against the shard that has the dupes
(distrib=false):
>
>
http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false
>
> Then the dupes are returned:
>
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":0,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "shards.info":"true",
>       "distrib":"false",
>       "wt":"json",
>       "rows":"10"}},
>   "response":{"numFound":5,"start":0,"docs":[
>       {
>         "id":"2-1-1",
>         "name":"grandchild"},
>       {
>         "id":"2-1-1",
>         "name":"grandchild2"},
>       {
>         "id":"2-1-1",
>         "name":"grandchild3"},
>       {
>         "id":"2-1",
>         "name":"child"},
>       {
>         "id":"2",
>         "name":"parent",
>         "_version_":1495272401361960960}]
>   }}
>
> So I guess my question is why doesn't the non-distrib query do
> de-duping? Mainly confirming this is how it's supposed to work and
> this behavior doesn't strike anyone else as odd ;-)
>
> Cheers,
>
> Tim

Re: Num docs, block join, and dupes?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Tue, Mar 10, 2015 at 7:09 PM, Timothy Potter <th...@gmail.com>
wrote:

> So I guess my question is why doesn't the non-distrib query do
> de-duping?
>

Tim,
that's by design behavior. the special _root_ field is used as a delete
term when a block update is applied i.e in case of block, <uniqueKey> is
not used. see
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/update/DirectUpdateHandler2.java#L224
I agree that's one of the issues of the current block update
implementation, but frankly speaking, I didn't consider it as an oddity. Do
you? What do you want to achieve?

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>