You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike O'Leary <tm...@uw.edu> on 2012/02/18 02:05:23 UTC
Using nested entities in FileDataSource import of xml file contents
Can anybody help me understand the right way to define a data-config.xml file with nested entities for indexing the contents of an XML file?
I used this data-config.xml file to index a database containing sample patient records:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/bioscope" user="db_user" password=""/>
<document name="bioscope">
<entity name="docs" pk="doc_id" query="SELECT doc_id, type FROM bioscope.docs">
<field column="doc_id" name="doc_id"/>
<field column="type" name="doc_type"/>
<entity name="codes" query="SELECT id, origin, type, code FROM bioscope.codes WHERE doc_id='${docs.doc_id}'">
<field column="origin" name="code_origin"/>
<field column="type" name="code_type"/>
<field column="code" name="code_value"/>
</entity>
<entity name="notes" query="SELECT id, origin, type, text FROM bioscope.texts WHERE doc_id='${docs.doc_id}'">
<field column="origin" name="note_origin"/>
<field column="type" name="note_type"/>
<field column="text" name="note_text"/>
</entity>
</entity>
</document>
</dataConfig>
I would like to do the same thing with an XML file containing the same data as is in the database. That XML file looks like this:
<docs>
<doc id="97634811" type="RADIOLOGY_REPORT">
<codes>
<code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code>
<code origin="COMPANY3" type="ICD-9-CM">786.2</code>
<code origin="COMPANY1" type="ICD-9-CM">786.2</code>
<code origin="COMPANY2" type="ICD-9-CM">786.2</code>
</codes>
<texts>
<text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY">Seventeen year old with cough.</text>
<text origin="CCHMC_RADIOLOGY" type="IMPRESSION">Normal.</text>
</texts>
</doc>
....
</docs>
I tried using this data-config.xml file, in order to preserve the nested entity structure used with the database case:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8"/>
<document name="bioscope">
<entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml">
<field column="doc_id" xpath="/docs/doc/@id"/>
<field column="doc_type" xpath="/docs/doc/@type"/>
<entity name="code" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc[@id='${doc.doc_id}']/codes/code" url="C:/data/bioscope.xml">
<field column="code_origin" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/>
<field column="code_type" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/>
<field column="code_value" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/>
</entity>
<entity name="note" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc[@id='${doc.doc_id}']/texts/text" url="C:/data/bioscope.xml">
<field column="note_origin" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/>
<field column="note_type" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/>
<field column="note_text" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/>
</entity>
</entity>
</document>
</dataConfig>
This is wrong, and it fails to index any of the <codes> and <texts> blocks in the XML file. I'm sure that part of the problem must be that the xpath expressions such as "/docs/doc[@id='${doc.doc_id}']/texts/text/@origin" fail to match anything in the XML file, because when I try the same import without nested entities, using this data-config.xml file, the <codes> and <texts> blocks are also not indexed:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8"/>
<document name="bioscope">
<entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml">
<field column="doc_id" xpath="/docs/doc/@id"/>
<field column="doc_type" xpath="/docs/doc/@type"/>
<field column="code_origin" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@origin"/>
<field column="code_type" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code/@type"/>
<field column="code_value" xpath="/docs/doc[@id='${doc.doc_id}']/codes/code"/>
<field column="note_origin" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@origin"/>
<field column="note_type" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text/@type"/>
<field column="note_text" xpath="/docs/doc[@id='${doc.doc_id}']/texts/text"/>
</entity>
</document>
</dataConfig>
However, when I use this data-config.xml file, which doesn't use nested entities, all of the fields are included in the index:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8"/>
<document name="bioscope">
<entity name="doc" processor="XPathEntityProcessor" stream="true" forEach="/docs/doc" url="C:/data/bioscope.xml">
<field column="doc_id" xpath="/docs/doc/@id"/>
<field column="doc_type" xpath="/docs/doc/@type"/>
<field column="code_origin" xpath="/docs/doc/codes/code/@origin"/>
<field column="code_type" xpath="/docs/doc/codes/code/@type"/>
<field column="code_value" xpath="/docs/doc/codes/code"/>
<field column="note_origin" xpath="/docs/doc/texts/text/@origin"/>
<field column="note_type" xpath="/docs/doc/texts/text/@type"/>
<field column="note_text" xpath="/docs/doc/texts/text"/>
</entity>
</document>
</dataConfig>
but I don't think any correspondence is maintained between the code_origin, code_type and code_value field values and the note_origin, note_type and note_text field values that are grouped together in the input XML file.
It has taken me a while to get this far, and obviously I don't have it right yet. Can anybody help me define a data-config.xml file with nested entities for indexing an XML file?
Thanks,
Mike