You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2015/04/06 21:17:12 UTC

[jira] [Updated] (SOLR-7353) Duplicated child/grand-child docs in a block-join structure should be removed by the shard hosting the docs not by the query controller

     [ https://issues.apache.org/jira/browse/SOLR-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-7353:
-------------------------------
    Description: 
I've indexed the following 8 docs into a 2-shard collection (Solr 4.8'ish - internal custom branch roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have dup'd keys:

{code}
[
  {
    "id":"1",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"1-1",
        "name":"child"
      },
      {
        "id":"1-2",
        "name":"child"
      }
    ]
  },
  {
    "id":"2",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"2-1",
        "name":"child",
        "_childDocuments_":[
          {
            "id":"2-1-1",
            "name":"grandchild"
          },
          {
            "id":"2-1-1",
            "name":"grandchild2"
          },
          {
            "id":"2-1-1",
            "name":"grandchild3"
          }
        ]
      }
    ]
  }
]
{code}
When I query this collection, using:

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10

I get:
{code}
{
  "responseHeader":{
    "status":0,
    "QTime":9,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "wt":"json",
      "rows":"10"}},
  "shards.info":{
    "http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/":{
      "numFound":3,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8984/solr/blockjoin2_shard1_replica1",
      "time":4},
    "http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/":{
      "numFound":5,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8985/solr/blockjoin2_shard2_replica2",
      "time":4}},
  "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
      {
        "id":"1-1",
        "name":"child"},
      {
        "id":"1-2",
        "name":"child"},
      {
        "id":"1",
        "name":"parent",
        "_version_":1495272401329455104},
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}
{code}

So Solr has de-duped the results.

If I execute this query against the shard that has the dupes (distrib=false):

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false

Then the dupes are returned:
{code}
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "distrib":"false",
      "wt":"json",
      "rows":"10"}},
  "response":{"numFound":5,"start":0,"docs":[
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1-1",
        "name":"grandchild2"},
      {
        "id":"2-1-1",
        "name":"grandchild3"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}
{code}
Shouldn't the distrib and non-distrib (direct to shard) queries produce consistent results wrt this block?

Of course we shouldn't index dupes, but I don't think it's the query controller's job to de-dupe and change numDocs, esp. based on the value of the rows parameter. Other users have reported this problem on the mailing list:

>>>
We've seen this as well. Before we understood the cause, it seemed very
bizarre that hitting different nodes would yield different numFound, as
well as using different rows=N (since the proxying node only de-dupe the
documents that are returned in the response).

I think "consistency" and "correctness" should be clearly delineated. Of
course we'd rather have consistently correct result, but failing that, I'd
rather have consistently incorrect result rather than inconsistent results
because otherwise it's even hard to debug, as was the case here.

I think either the node hosting the shard should also do the de-duping, or
no one should. It's strange that the proxying node decides to do some
sketchy limited result set de-dupe.
<<<

I'm opening this ticket to investigate how to address this issue.

  was:
I've indexed the following 8 docs into a 2-shard collection (Solr 4.8'ish - internal custom branch roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have dup'd keys:

[
  {
    "id":"1",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"1-1",
        "name":"child"
      },
      {
        "id":"1-2",
        "name":"child"
      }
    ]
  },
  {
    "id":"2",
    "name":"parent",
    "_childDocuments_":[
      {
        "id":"2-1",
        "name":"child",
        "_childDocuments_":[
          {
            "id":"2-1-1",
            "name":"grandchild"
          },
          {
            "id":"2-1-1",
            "name":"grandchild2"
          },
          {
            "id":"2-1-1",
            "name":"grandchild3"
          }
        ]
      }
    ]
  }
]

When I query this collection, using:

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10

I get:

{
  "responseHeader":{
    "status":0,
    "QTime":9,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "wt":"json",
      "rows":"10"}},
  "shards.info":{
    "http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/":{
      "numFound":3,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8984/solr/blockjoin2_shard1_replica1",
      "time":4},
    "http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/":{
      "numFound":5,
      "maxScore":1.0,
      "shardAddress":"http://localhost:8985/solr/blockjoin2_shard2_replica2",
      "time":4}},
  "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
      {
        "id":"1-1",
        "name":"child"},
      {
        "id":"1-2",
        "name":"child"},
      {
        "id":"1",
        "name":"parent",
        "_version_":1495272401329455104},
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}


So Solr has de-duped the results.

If I execute this query against the shard that has the dupes (distrib=false):

http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false

Then the dupes are returned:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "indent":"true",
      "q":"*:*",
      "shards.info":"true",
      "distrib":"false",
      "wt":"json",
      "rows":"10"}},
  "response":{"numFound":5,"start":0,"docs":[
      {
        "id":"2-1-1",
        "name":"grandchild"},
      {
        "id":"2-1-1",
        "name":"grandchild2"},
      {
        "id":"2-1-1",
        "name":"grandchild3"},
      {
        "id":"2-1",
        "name":"child"},
      {
        "id":"2",
        "name":"parent",
        "_version_":1495272401361960960}]
  }}

Shouldn't the distrib and non-distrib (direct to shard) queries produce consistent results wrt this block?

Of course we shouldn't index dupes, but I don't think it's the query controller's job to de-dupe and change numDocs, esp. based on the value of the rows parameter. Other users have reported this problem on the mailing list:

>>>
We've seen this as well. Before we understood the cause, it seemed very
bizarre that hitting different nodes would yield different numFound, as
well as using different rows=N (since the proxying node only de-dupe the
documents that are returned in the response).

I think "consistency" and "correctness" should be clearly delineated. Of
course we'd rather have consistently correct result, but failing that, I'd
rather have consistently incorrect result rather than inconsistent results
because otherwise it's even hard to debug, as was the case here.

I think either the node hosting the shard should also do the de-duping, or
no one should. It's strange that the proxying node decides to do some
sketchy limited result set de-dupe.
<<<

I'm opening this ticket to investigate how to address this issue.


> Duplicated child/grand-child docs in a block-join structure should be removed by the shard hosting the docs not by the query controller
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7353
>                 URL: https://issues.apache.org/jira/browse/SOLR-7353
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Priority: Minor
>
> I've indexed the following 8 docs into a 2-shard collection (Solr 4.8'ish - internal custom branch roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have dup'd keys:
> {code}
> [
>   {
>     "id":"1",
>     "name":"parent",
>     "_childDocuments_":[
>       {
>         "id":"1-1",
>         "name":"child"
>       },
>       {
>         "id":"1-2",
>         "name":"child"
>       }
>     ]
>   },
>   {
>     "id":"2",
>     "name":"parent",
>     "_childDocuments_":[
>       {
>         "id":"2-1",
>         "name":"child",
>         "_childDocuments_":[
>           {
>             "id":"2-1-1",
>             "name":"grandchild"
>           },
>           {
>             "id":"2-1-1",
>             "name":"grandchild2"
>           },
>           {
>             "id":"2-1-1",
>             "name":"grandchild3"
>           }
>         ]
>       }
>     ]
>   }
> ]
> {code}
> When I query this collection, using:
> http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10
> I get:
> {code}
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":9,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "shards.info":"true",
>       "wt":"json",
>       "rows":"10"}},
>   "shards.info":{
>     "http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/":{
>       "numFound":3,
>       "maxScore":1.0,
>       "shardAddress":"http://localhost:8984/solr/blockjoin2_shard1_replica1",
>       "time":4},
>     "http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/":{
>       "numFound":5,
>       "maxScore":1.0,
>       "shardAddress":"http://localhost:8985/solr/blockjoin2_shard2_replica2",
>       "time":4}},
>   "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
>       {
>         "id":"1-1",
>         "name":"child"},
>       {
>         "id":"1-2",
>         "name":"child"},
>       {
>         "id":"1",
>         "name":"parent",
>         "_version_":1495272401329455104},
>       {
>         "id":"2-1-1",
>         "name":"grandchild"},
>       {
>         "id":"2-1",
>         "name":"child"},
>       {
>         "id":"2",
>         "name":"parent",
>         "_version_":1495272401361960960}]
>   }}
> {code}
> So Solr has de-duped the results.
> If I execute this query against the shard that has the dupes (distrib=false):
> http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false
> Then the dupes are returned:
> {code}
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":0,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "shards.info":"true",
>       "distrib":"false",
>       "wt":"json",
>       "rows":"10"}},
>   "response":{"numFound":5,"start":0,"docs":[
>       {
>         "id":"2-1-1",
>         "name":"grandchild"},
>       {
>         "id":"2-1-1",
>         "name":"grandchild2"},
>       {
>         "id":"2-1-1",
>         "name":"grandchild3"},
>       {
>         "id":"2-1",
>         "name":"child"},
>       {
>         "id":"2",
>         "name":"parent",
>         "_version_":1495272401361960960}]
>   }}
> {code}
> Shouldn't the distrib and non-distrib (direct to shard) queries produce consistent results wrt this block?
> Of course we shouldn't index dupes, but I don't think it's the query controller's job to de-dupe and change numDocs, esp. based on the value of the rows parameter. Other users have reported this problem on the mailing list:
> >>>
> We've seen this as well. Before we understood the cause, it seemed very
> bizarre that hitting different nodes would yield different numFound, as
> well as using different rows=N (since the proxying node only de-dupe the
> documents that are returned in the response).
> I think "consistency" and "correctness" should be clearly delineated. Of
> course we'd rather have consistently correct result, but failing that, I'd
> rather have consistently incorrect result rather than inconsistent results
> because otherwise it's even hard to debug, as was the case here.
> I think either the node hosting the shard should also do the de-duping, or
> no one should. It's strange that the proxying node decides to do some
> sketchy limited result set de-dupe.
> <<<
> I'm opening this ticket to investigate how to address this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org