You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike O'Leary <tm...@uw.edu> on 2012/03/03 00:29:48 UTC

Including an attribute value from a higher level entity when using DIH to index an XML file

I have an XML file that I would like to index, that has a structure similar to this:

<data>
  <user id="[id-num]">
    <message date="[date]">[message text]</message>
    ...
  </user>
  ...
</data>

I would like to have the documents in the index correspond to the messages in the xml file, and have the user's [id-num] value stored as a field in each of the user's documents. I think this means that I have to define an entity for message that looks like this:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8" />
  <document>
    <entity name="message"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/data/user/message/"
            url="message-data.xml">
      <field column="date" xpath="/data/user/message/@date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"/>
      <field column="text" xpath="/data/user/message" />
   </entity>
  </document>
</dataConfig>

but I don't know where to put the field definition for the user id. It would look like

<field column="id" xpath="/data/user/@id" />

I can't put it within the message entity, because it is defined with forEach="/data/user/message/" and the id field's xpath value is outside of the entity's scope. Putting the id field definition there causes a null pointer exception. I don't think I want to create a "user" entity that the "message" entity is nested inside of, or is there a way to do that and still have the index documents correspond to messages from the file? Are there one or more attributes or values of attribute that I haven't run across in my searching that provide a way to do what I need to do?
Thanks,
Mike



RE: Including an attribute value from a higher level entity when using DIH to index an XML file

Posted by Mike O'Leary <tm...@uw.edu>.
I found an answer to my question, but it comes with a cost. With an XML file like this (this is simplified to remove extraneous elements and attributes):

<data>
  <user id="[id-num]">
    <message date="[date]">[message text]</message>
    ...
  </user>
  ...
</data>

I can index the user id as a field in documents that represent each of the user's messages with this data-config expression:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8" />
  <document>
    <entity name="message"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/data/user/message | /data/user"
            url="message-data.xml">
      <field column="id" xpath="/data/user/@id" commonField="true"/>
      <field column="date" xpath="/data/user/message/@date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"/>
      <field column="text" xpath="/data/user/message" />
   </entity>
  </document>
</dataConfig>

I didn't realize that commonField would work for cases in which the previously encountered field is in an element that encompasses the other elements, but it does. The forEach value has to be "/data/user/message | /data/user" in order for the user id to be located, since it is not under /data/user/message.

By specifying forEach="/data/user/message | /data/user" I am saying that each /data/user or /data/user/message element is a document in the index, but I don't really want /data/user elements to be treated this way. As luck would have it, those documents are filtered out, only because date and text are required fields, and they have not been assigned values yet when a document is created for a /data/user element, so an exception is thrown. I could live with this, but it's kind of ugly.

I don't see any other way of doing what I need to do with embedded XML elements though. I tried creating nested entities in the data-config file, but each one of them is required to have a url attribute, and I think that caused the input file to be read twice.

The only other possibility I could see from reading the DataImportHandler documentation was to specify an XSL file and change the XML file's structure so that the user id attribute is moved down to be an attribute of the message element. I'm not sure it's worth it to do something like that for what seems like a small problem, and I wonder how much it would slow down the importing of a large XML file.

Are there any other ways of handling cases like this, where an attribute of an outer element is to be included in an index document that corresponds to an element nested inside it?
Thanks,
Mike

-----Original Message-----
From: Mike O'Leary [mailto:tmoleary@uw.edu] 
Sent: Friday, March 02, 2012 3:30 PM
To: Solr-User (solr-user@lucene.apache.org)
Subject: Including an attribute value from a higher level entity when using DIH to index an XML file

I have an XML file that I would like to index, that has a structure similar to this:

<data>
  <user id="[id-num]">
    <message date="[date]">[message text]</message>
    ...
  </user>
  ...
</data>

I would like to have the documents in the index correspond to the messages in the xml file, and have the user's [id-num] value stored as a field in each of the user's documents. I think this means that I have to define an entity for message that looks like this:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8" />
  <document>
    <entity name="message"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/data/user/message/"
            url="message-data.xml">
      <field column="date" xpath="/data/user/message/@date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"/>
      <field column="text" xpath="/data/user/message" />
   </entity>
  </document>
</dataConfig>

but I don't know where to put the field definition for the user id. It would look like

<field column="id" xpath="/data/user/@id" />

I can't put it within the message entity, because it is defined with forEach="/data/user/message/" and the id field's xpath value is outside of the entity's scope. Putting the id field definition there causes a null pointer exception. I don't think I want to create a "user" entity that the "message" entity is nested inside of, or is there a way to do that and still have the index documents correspond to messages from the file? Are there one or more attributes or values of attribute that I haven't run across in my searching that provide a way to do what I need to do?
Thanks,
Mike