You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jordi Cabré <je...@gmail.com> on 2020/11/13 12:15:46 UTC

Solr DIH: empty child document transformer

I will try to explain myself in as much detail as possible and isolating as
much as possible from the context.

Shortly, I'm trying to create a `DIH` in order to digest some documents as
nested. I mean, I need to digest an `one-to-many` relation and put it as
nested documents.

My `parents` data is:


    +----+---------------+-------------+
    | id |    name_s     | node_type_s |
    +====+===============+=============+
    |  1 | parent-name-1 | parent      |
    |  2 | parent-name-2 | parent      |
    |  3 | parent-name-3 | parent      |
    +----+---------------+-------------+

And `children` data is:


    +-----+-------------+--------------+-------------+
    | id  | parent_id_s |    name_s    | node_type_s |
    +=====+=============+==============+=============+
    | 1-1 |           1 | child-name-1 | child       |
    | 2-1 |           1 | child-name-2 | child       |
    | 3-2 |           2 | child-name-3 | child       |
    | 4-3 |           3 | child-name-4 | child       |
    +-----+-------------+--------------+-------------+


Here my `DIH` configuration:

    <dataConfig>
    <dataSource
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"

url="jdbc:sqlserver://${dataimporter.request.host};databaseName=${dataimporter.request.database}"
    user="${dataimporter.request.user}"
    password="${dataimporter.request.password}"
    />

    <document>
    <entity
    name="parent"
    query="select '1' as id, 'parent-name-1' as name_s, 'parent' as
node_type_s union select '2' as id, 'parent-name-2' as name_s, 'parent' as
node_type_s union select '3' as id, 'parent-name-3' as name_s, 'parent' as
node_type_s">

    <field column="node_type_s"/>
    <field column="id"/>
    <field column="name_s"/>

    <entity
    name="children"
    child="true"
                query="select '1-1' as id, '1' as parent_id_s,
'child-name-1' as name_s, 'child' as node_type_s union select '2-1' as id,
'1' as parent_id_s, 'child-name-2' as name_s, 'child' as node_type_s union
select '3-2' as id, '2' as parent_id_s, 'child-name-3' as name_s, 'child'
as node_type_s union select '4-3' as id, '3' as parent_id_s, 'child-name-4'
as name_s, 'child' as node_type_s"
    cacheKey="parent_id_s" cacheLookup="parent.id"
cacheImpl="SortedMapBackedCache">

    <field column="node_type_s"/>
    <field column="id"/>
    <field column="parent_id_s"/>
    <field column="name_s"/>

    </entity>

    </entity>

    </document>
    </dataConfig>

As you can see, `child="true"` into `nested entity`.

After having performed my data import handler:

    {
      "responseHeader": {
        "status": 0,
        "QTime": 0
      },
      "initArgs": [
        "defaults",
        [
          "config",
          "parent-children-config.xml"
        ]
      ],
      "command": "status",
      "status": "idle",
      "importResponse": "",
      "statusMessages": {
        "Total Requests made to DataSource": "2",
        "Total Rows Fetched": "7",
        "Total Documents Processed": "3",
        "Total Documents Skipped": "0",
        "Full Dump Started": "2020-11-12 08:02:25",
        "": "Indexing completed. Added/Updated: 3 documents. Deleted 0
documents.",
        "Committed": "2020-11-12 08:02:25",
        "Time taken": "0:0:0.304"
      }
    }

So, digestion seems to be worked well.

After that, I've tested how to get only parents `q={!parent
which=node_type_s:parent}`:

    {
       "responseHeader":{
          "status":0,
          "QTime":1,
          "params":{
             "q":"{!parent which=node_type_s:parent}",
             "_":"1605166879678"
          }
       },
       "response":{
          "numFound":3,
          "start":0,
          "numFoundExact":true,
          "docs":[
             {
                "name_s":"parent-name-1",
                "node_type_s":"parent",
                "id":"1",
                "_version_":1683140793502531584
             },
             {
                "name_s":"parent-name-2",
                "node_type_s":"parent",
                "id":"2",
                "_version_":1683140793504628736
             },
             {
                "name_s":"parent-name-3",
                "node_type_s":"parent",
                "id":"3",
                "_version_":1683140793505677312
             }
          ]
       }
    }

As you can see, only `parents` are returned.

When I'm asking for only `children`:

    {
       "responseHeader":{
          "status":0,
          "QTime":3,
          "params":{
             "q":"{!child of=\"node_type_s:parent\"}",
             "_":"1605166879678"
          }
       },
       "response":{
          "numFound":4,
          "start":0,
          "numFoundExact":true,
          "docs":[
             {
                "name_s":"child-name-1",
                "node_type_s":"child",
                "parent_id_s":"1",
                "id":"1-1",
                "_version_":1683140793502531584
             },
             {
                "name_s":"child-name-2",
                "node_type_s":"child",
                "parent_id_s":"1",
                "id":"2-1",
                "_version_":1683140793502531584
             },
             {
                "name_s":"child-name-3",
                "node_type_s":"child",
                "parent_id_s":"2",
                "id":"3-2",
                "_version_":1683140793504628736
             },
             {
                "name_s":"child-name-4",
                "node_type_s":"child",
                "parent_id_s":"3",
                "id":"4-3",
                "_version_":1683140793505677312
             }
          ]
       }
    }

All right, only children documents are returned.

Then, I've also tried to get only `childrens of parent 1`:

    {
       "responseHeader":{
          "status":0,
          "QTime":0,
          "params":{
             "q":"{!child of=\"node_type_s:parent\"}id:1",
             "_":"1605166879678"
          }
       },
       "response":{
          "numFound":2,
          "start":0,
          "numFoundExact":true,
          "docs":[
             {
                "name_s":"child-name-1",
                "node_type_s":"child",
                "parent_id_s":"1",
                "id":"1-1",
                "_version_":1683140793502531584
             },
             {
                "name_s":"child-name-2",
                "node_type_s":"child",
                "parent_id_s":"1",
                "id":"2-1",
                "_version_":1683140793502531584
             }
          ]
       }
    }

So, everything seems to work correctly.

Problem arises when I'm trying to get `parents with their children` using
`fl=*, [child]` and `q=id:1`:

    {
       "responseHeader":{
          "status":0,
          "QTime":1,
          "params":{
             "q":"id:1",
             "fl":"*,[child]",
             "_":"1605166879678"
          }
       },
       "response":{
          "numFound":1,
          "start":0,
          "numFoundExact":true,
          "docs":[
             {
                "name_s":"parent-name-2",
                "node_type_s":"parent",
                "id":"2",
                "_version_":1683140793504628736
             }
          ]
       }
    }

As you can see, only parent is returned without their children.

I've been facing and strugling with that a lot and I've been spending a lot
of efforts in order to figure out what I'm doing wrong.

After having spent days trying to solve that, I've tried to digest the same
data into json format.

Here, I've tried to digest the same data a json document. Json data here:

    {
        "id":"1",
        "name_s":"parent-name-1",
        "node_type_s":"parent",
        "children":[
            {
            "id":"1-1",
            "parent_id_s":"1",
            "name_s":"child-name-1",
            "node_type_s":"child"
            },
            {
            "id":"2-1",
            "parent_id_s":"1",
            "name_s":"child-name-2",
            "node_type_s":"child"
            }
        ]
    },
    {
        "id":"2",
        "name_s":"parent-name-2",
        "node_type_s":"parent",
        "children":[
            {
            "id":"3-2",
            "parent_id_s":"2",
            "name_s":"child-name-3",
            "node_type_s":"child"
            }
        ]
    },
    {
        "id":"3",
        "name_s":"parent-name-3",
        "node_type_s":"parent",
        "children":[
            {
            "id":"4-3",
            "parent_id_s":"3",
            "name_s":"child-name-4",
            "node_type_s":"child"
            }
        ]
    }

After having digest these json that contains exactly the same data from my
`parents` and `children` tables, I've tested the same above queries.

All of them works fine, including the last one. I mean, `get parent with
id:1 with its children`:

    {
       "responseHeader":{
          "status":0,
          "QTime":1,
          "params":{
             "q":"id:1",
             "fl":"*,[child]",
             "_":"1605166879678"
          }
       },
       "response":{
          "numFound":1,
          "start":0,
          "numFoundExact":true,
          "docs":[
             {
                "id":"1",
                "name_s":"parent-name-1",
                "node_type_s":"parent",
                "_version_":1683142824077295616,
                "children":[
                   {
                      "id":"1-1",
                      "parent_id_s":"1",
                      "name_s":"child-name-1",
                      "node_type_s":"child",
                      "_version_":1683142824077295616
                   },
                   {
                      "id":"2-1",
                      "parent_id_s":"1",
                      "name_s":"child-name-2",
                      "node_type_s":"child",
                      "_version_":1683142824077295616
                   }
                ]
             }
          ]
       }
    }

Here, you can see, all children are returned.

After that, I've compared schemas after having been imported data with
`JSON` and with `DIH`.

In `JSON`-way, schema contains two elements that `DIH`-way schema doesn't
contain:

    {
        "name":"children",
        "type":"text_general"
    },
    ...
    "copyFields":[{
        "source":"children",
        "dest":"children_str",
        "maxChars":256
    }]

What I'm doing wrong into my `DIH`?

I hope I've explained so well...