You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike O'Leary <tm...@uw.edu> on 2012/02/18 02:05:23 UTC

Using nested entities in FileDataSource import of xml file contents

Can anybody help me understand the right way to define a data-config.xml file with nested entities for indexing the contents of an XML file?

I used this data-config.xml file to index a database containing sample patient records:

<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/bioscope" user="db_user" password=""/>
  <document name="bioscope">
    <entity name="docs" pk="doc_id" query="SELECT doc_id, type FROM bioscope.docs">
      <field column="doc_id" name="doc_id"/>
      <field column="type" name="doc_type"/>
      <entity name="codes" query="SELECT id, origin, type, code FROM bioscope.codes WHERE doc_id='${docs.doc_id}'">
        <field column="origin" name="code_origin"/>
        <field column="type" name="code_type"/>
        <field column="code" name="code_value"/>
      </entity>
      <entity name="notes" query="SELECT id, origin, type, text FROM bioscope.texts WHERE doc_id='${docs.doc_id}'">
        <field column="origin" name="note_origin"/>
        <field column="type" name="note_type"/>
        <field column="text" name="note_text"/>
      </entity>
    </entity>
  </document>
</dataConfig>

I would like to do the same thing with an XML file containing the same data as is in the database. That XML file looks like this:

<docs>
  <doc id="97634811" type="RADIOLOGY_REPORT">
    <codes>
      <code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code>
      <code origin="COMPANY3" type="ICD-9-CM">786.2</code>
      <code origin="COMPANY1" type="ICD-9-CM">786.2</code>
      <code origin="COMPANY2" type="ICD-9-CM">786.2</code>
    </codes>
    <texts>
      <text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY">Seventeen year old with cough.</text>
      <text origin="CCHMC_RADIOLOGY" type="IMPRESSION">Normal.</text>
    </texts>
  </doc>
  ....
</docs>

I tried using this data-config.xml file, in order to preserve the nested entity structure used with the database case:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document name="bioscope">
    <entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml">
      <field column="doc_id" xpath="/docs/doc/@id"/>
      <field column="doc_type" xpath="/docs/doc/@type"/>
      <entity name="code" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc[@id='${doc.doc_id}']/codes/code" url="C:/data/bioscope.xml">
        <field column="code_origin" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/>
        <field column="code_type" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/>
        <field column="code_value" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/>
      </entity>
      <entity name="note" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc[@id='${doc.doc_id}']/texts/text" url="C:/data/bioscope.xml">
       <field column="note_origin" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/>
       <field column="note_type" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/>
       <field column="note_text" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/>
      </entity>
    </entity>
  </document>
</dataConfig>

This is wrong, and it fails to index any of the <codes> and <texts> blocks in the XML file. I'm sure that part of the problem  must be that the xpath expressions such as "/docs/doc[@id='${doc.doc_id}']/texts/text/@origin" fail to match anything in the XML file, because when I try the same import without nested entities, using this data-config.xml file, the <codes> and <texts> blocks are also not indexed:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document name="bioscope">
    <entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml">
      <field column="doc_id" xpath="/docs/doc/@id"/>
      <field column="doc_type" xpath="/docs/doc/@type"/>
      <field column="code_origin" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/>
      <field column="code_type" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/>
      <field column="code_value" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/>
      <field column="note_origin" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/>
      <field column="note_type" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/>
      <field column="note_text" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/>
    </entity>
  </document>
</dataConfig>

However, when I use this data-config.xml file, which doesn't use nested entities, all of the fields are included in the index:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document name="bioscope">
    <entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml">
      <field column="doc_id" xpath="/docs/doc/@id"/>
      <field column="doc_type" xpath="/docs/doc/@type"/>
      <field column="code_origin" xpath="/docs/doc/codes/code/@origin"/>
      <field column="code_type" xpath="/docs/doc/codes/code/@type"/>
      <field column="code_value" xpath="/docs/doc/codes/code"/>
      <field column="note_origin" xpath="/docs/doc/texts/text/@origin"/>
      <field column="note_type" xpath="/docs/doc/texts/text/@type"/>
      <field column="note_text" xpath="/docs/doc/texts/text"/>
    </entity>
  </document>
</dataConfig>

but I don't think any correspondence is maintained between the code_origin, code_type and code_value field values and the note_origin, note_type and note_text field values that are grouped together in the input XML file.

It has taken me a while to get this far, and obviously I don't have it right yet. Can anybody help me define a data-config.xml file with nested entities for indexing an XML file?
Thanks,
Mike