You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Weiss, Eric" <we...@llnl.gov> on 2011/05/13 01:23:05 UTC

DIH help request: nested xml entities and xpath

Apologies in advance if this topic/question has been previously answered…I have scoured the docs, mail archives, web looking for an answer(s) with no luck.  I am sure I am just being dense or missing something obvious…please point out my stupidity as my head hurts trying to get this working.

Solr 3.1
Java 1.6
Eclipse/Tomcat 7/Maven 2.x

Goal: to extract manufacturer names from a repeating list of keywords each denoted by a Category, one of which is "Manufacturer", and load them into a MsgKeywordMF field  (see xml below)

I have xml files I am loading via DIH.  This an abbreviated example xml data (each file has repeating "Report" items, each report has repeating MsgSet, Msg, MsgList, etc items).  Notice the nested repeating groups, namely MsgItems, within each document (Report):


<Report>

  <ReportMeta>

    <ReportDate>02/22/2011</ReportDate>

     …

  </ReportMeta>

  <MsgSet>

    <Msg>

      <SourceDocID>http://someurl.com/path/to/doc</SourceDocID>

       …

      <DocumentText>........blah blah</DocumentText>

      <MsgList>

        <MsgItem>

          <MsgType>SomeType</MsgType>

          <Category>Location</Category>

          <Keyword>USA</Keyword>

        </MsgItem>

        <MsgItem>

          <MsgType>AnotherType</MsgType>

          <Category>Manufacturer</Category>

          <Keyword>Apple</Keyword>

        </MsgItem>

        …

      </MsgList>

    </Msg>

  </MsgSet>

</Report>
<Report>
…
</Report>
<Report>
…
</Report>
…

Here is my data-config.xml:


<dataConfig>

  <dataSource type="FileDataSource" encoding="UTF-8" />


  <document>

    <entity name="fileload" rootEntity="false"

            processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="false" baseDir="/files/xml/">

      <entity name="report"

            rootEntity="true" pk="id"

              url="${fileload.fileAbsolutePath}" processor="XPathEntityProcessor"

              forEach="/Report/MsgSet/Msg" onError="skip"

              transformer="DateFormatTransformer,RegexTransformer">

          <field column="DocumentText" xpath="/Report/MsgSet/Msg/DocumentText"/>

          <field column="id" xpath="/Report/MsgSet/Msg/SourceDocID"/>

  <field column="MsgCategory" xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Category" />

  <field column="MsgKeyword" xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Keyword" />

  <field column="MsgKeywordMF" xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword" />

          …

      </entity>

    </entity>

  </document>

</dataConfig>


As seen in my config and sample data above, I am extracting the repeating "Keywords" into the the MsgKeyword field.  Also, and the part that does NOT work, I am trying to extract into a separate field just the keywords that have a "Category" of "Manufacturer" -->   <field column="MsgKeywordMF" xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword" />

I have also tried: <field column="MsgKeywordMF" xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keyword" />
…after changing the "Category" to an attribute of MsgItem (<MsgItem Category="Location">) but it too fails to match.

I have tested my xpath notation against my xml data file using various xpath evaluator tools, like within Eclipse, and it matches perfectly…but I can't get it to match/work during import.

As I am able to understand it, DIH does not support nested/correlated entities, at least not with XML data sources using nested entity tags.  I've tried without success to nest entities but I can't "correlate" the nested entity with the parent.  I think the way I'm trying should work, but no luck so far….

BTW, I can't easily change the xml format, although it is possible with some pain…

Any ideas?

TIA,
-- Eric


Re: DIH help request: nested xml entities and xpath

Posted by "Weiss, Eric" <we...@llnl.gov>.
I think my original question/thread was accidentally pwnd.  Let me take
this opportunity to refocus this thread to my original question about DIH
and nested entities and xpath.  I'll try to ask a very simple question
instead:

Why doesn't this field xpath work?  By "not working" I mean the
MsgKeywordMF field does not populate in the index...unless I remove the
xpath filter.

<field column="MsgKeywordMF"
xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword"
 />

OR

<field column="MsgKeywordMF"
xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keyword
" />

- I modified the original xml so Category was an attribute to MsgItem
instead...still does not work despite this matching in other tools and
explicitly documented in the DIH wiki page.




Full details below (in original post as well).

Thx,
-- Eric


On Fri, May 13, 2011 at 4:53 AM, Weiss, Eric <we...@llnl.gov> wrote:

>Apologies in advance if this topic/question has been previously answeredŠI
>have scoured the docs, mail archives, web looking for an answer(s) with no
>luck.  I am sure I am just being dense or missing something obviousŠplease
>point out my stupidity as my head hurts trying to get this working.
>
>Solr 3.1
>Java 1.6
>Eclipse/Tomcat 7/Maven 2.x
>
>Goal: to extract manufacturer names from a repeating list of keywords each
>denoted by a Category, one of which is "Manufacturer", and load them into
>a
>MsgKeywordMF field  (see xml below)
>
>I have xml files I am loading via DIH.  This an abbreviated example xml
>data (each file has repeating "Report" items, each report has repeating
>MsgSet, Msg, MsgList, etc items).  Notice the nested repeating groups,
>namely MsgItems, within each document (Report):
>
>
><Report>
>
>  <ReportMeta>
>
>    <ReportDate>02/22/2011</ReportDate>
>
>     Š
>
>  </ReportMeta>
>
>  <MsgSet>
>
>    <Msg>
>
>      <SourceDocID>http://someurl.com/path/to/doc</SourceDocID>
><http://someurl.com/path/to/doc%3C/SourceDocID%3E>
>
>       Š
>
>      <DocumentText>........blah blah</DocumentText>
>
>      <MsgList>
>
>        <MsgItem>
>
>          <MsgType>SomeType</MsgType>
>
>          <Category>Location</Category>
>
>          <Keyword>USA</Keyword>
>
>        </MsgItem>
>
>        <MsgItem>
>
>          <MsgType>AnotherType</MsgType>
>
>          <Category>Manufacturer</Category>
>
>          <Keyword>Apple</Keyword>
>
>        </MsgItem>
>
>        Š
>
>      </MsgList>
>
>    </Msg>
>
>  </MsgSet>
>
></Report>
><Report>
>Š
></Report>
><Report>
>Š
></Report>
>Š
>
>Here is my data-config.xml:
>
>
><dataConfig>
>
>  <dataSource type="FileDataSource" encoding="UTF-8" />
>
>
>  <document>
>
>    <entity name="fileload" rootEntity="false"
>
>            processor="FileListEntityProcessor" fileName="^.*\.xml$"
>recursive="false" baseDir="/files/xml/">
>
>      <entity name="report"
>
>            rootEntity="true" pk="id"
>
>              url="${fileload.fileAbsolutePath}"
>processor="XPathEntityProcessor"
>
>              forEach="/Report/MsgSet/Msg" onError="skip"
>
>              transformer="DateFormatTransformer,RegexTransformer">
>
>          <field column="DocumentText"
>xpath="/Report/MsgSet/Msg/DocumentText"/>
>
>          <field column="id" xpath="/Report/MsgSet/Msg/SourceDocID"/>
>
>  <field column="MsgCategory"
>xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Category" />
>
>  <field column="MsgKeyword"
>xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Keyword" />
>
>  <field column="MsgKeywordMF"
>xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword
>"
>/>
>
>          Š
>
>      </entity>
>
>    </entity>
>
>  </document>
>
></dataConfig>
>
>
>As seen in my config and sample data above, I am extracting the repeating
>"Keywords" into the the MsgKeyword field.  Also, and the part that does
>NOT
>work, I am trying to extract into a separate field just the keywords that
>have a "Category" of "Manufacturer" -->   <field column="MsgKeywordMF"
>xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword
>"
>/>
>
>I have also tried: <field column="MsgKeywordMF"
>xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keywor
>d"
>/>
>Šafter changing the "Category" to an attribute of MsgItem (<MsgItem
>Category="Location">) but it too fails to match.
>
>I have tested my xpath notation against my xml data file using various
>xpath evaluator tools, like within Eclipse, and it matches perfectlyŠbut I
>can't get it to match/work during import.
>
>As I am able to understand it, DIH does not support nested/correlated
>entities, at least not with XML data sources using nested entity tags.
>I've
>tried without success to nest entities but I can't "correlate" the nested
>entity with the parent.  I think the way I'm trying should work, but no
>luck
>so farŠ.
>
>BTW, I can't easily change the xml format, although it is possible with
>some painŠ
>
>Any ideas?
>
>TIA,
>-- Eric
>
>






On 5/13/11 1:58 AM, "Gora Mohanty" <go...@mimirtech.com> wrote:

>On Fri, May 13, 2011 at 10:18 AM, Ashique <as...@gmail.com> wrote:
>> Hi All,
>>
>> I am a Java/J2ee programmer and very new to SOLR. I would  like to
>>index a
>> table in a postgresSql database to SOLR. Then searching the records
>>from a
>> GUI (Jsp Page) and showing the results in tabular form. Could any one
>>help
>> me out with a simple sample code.
>[...]
>
>This is too broad a question. Please start out by looking
>at the extensive Solr documentation:
>* Complete list: http://wiki.apache.org/solr/FrontPage
>* Initial tutorial: http://lucene.apache.org/solr/tutorial.html
>  It is a good idea to first ensure that you are able to get
>  this working.
>* If you are using Java, this should be of interest:
>  http://wiki.apache.org/solr/SolJava
>* For easy data import from a database, you could consider
>  using the DataImportHandler:
>  http://wiki.apache.org/solr/DataImportHandler
>
>You can ask here if you run into issues while trying these out.
>
>Regards,
>Gora


Re: DIH help request: nested xml entities and xpath

Posted by Gora Mohanty <go...@mimirtech.com>.
On Fri, May 13, 2011 at 10:18 AM, Ashique <as...@gmail.com> wrote:
> Hi All,
>
> I am a Java/J2ee programmer and very new to SOLR. I would  like to index a
> table in a postgresSql database to SOLR. Then searching the records from a
> GUI (Jsp Page) and showing the results in tabular form. Could any one help
> me out with a simple sample code.
[...]

This is too broad a question. Please start out by looking
at the extensive Solr documentation:
* Complete list: http://wiki.apache.org/solr/FrontPage
* Initial tutorial: http://lucene.apache.org/solr/tutorial.html
  It is a good idea to first ensure that you are able to get
  this working.
* If you are using Java, this should be of interest:
  http://wiki.apache.org/solr/SolJava
* For easy data import from a database, you could consider
  using the DataImportHandler:
  http://wiki.apache.org/solr/DataImportHandler

You can ask here if you run into issues while trying these out.

Regards,
Gora

Re: DIH help request: nested xml entities and xpath

Posted by Ashique <as...@gmail.com>.
Hi All,

I am a Java/J2ee programmer and very new to SOLR. I would  like to index a
table in a postgresSql database to SOLR. Then searching the records from a
GUI (Jsp Page) and showing the results in tabular form. Could any one help
me out with a simple sample code.

Thank you.

Regards,
Ashique

On Fri, May 13, 2011 at 4:53 AM, Weiss, Eric <we...@llnl.gov> wrote:

> Apologies in advance if this topic/question has been previously answered…I
> have scoured the docs, mail archives, web looking for an answer(s) with no
> luck.  I am sure I am just being dense or missing something obvious…please
> point out my stupidity as my head hurts trying to get this working.
>
> Solr 3.1
> Java 1.6
> Eclipse/Tomcat 7/Maven 2.x
>
> Goal: to extract manufacturer names from a repeating list of keywords each
> denoted by a Category, one of which is "Manufacturer", and load them into a
> MsgKeywordMF field  (see xml below)
>
> I have xml files I am loading via DIH.  This an abbreviated example xml
> data (each file has repeating "Report" items, each report has repeating
> MsgSet, Msg, MsgList, etc items).  Notice the nested repeating groups,
> namely MsgItems, within each document (Report):
>
>
> <Report>
>
>  <ReportMeta>
>
>    <ReportDate>02/22/2011</ReportDate>
>
>     …
>
>  </ReportMeta>
>
>  <MsgSet>
>
>    <Msg>
>
>      <SourceDocID>http://someurl.com/path/to/doc</SourceDocID>
>
>       …
>
>      <DocumentText>........blah blah</DocumentText>
>
>      <MsgList>
>
>        <MsgItem>
>
>          <MsgType>SomeType</MsgType>
>
>          <Category>Location</Category>
>
>          <Keyword>USA</Keyword>
>
>        </MsgItem>
>
>        <MsgItem>
>
>          <MsgType>AnotherType</MsgType>
>
>          <Category>Manufacturer</Category>
>
>          <Keyword>Apple</Keyword>
>
>        </MsgItem>
>
>        …
>
>      </MsgList>
>
>    </Msg>
>
>  </MsgSet>
>
> </Report>
> <Report>
> …
> </Report>
> <Report>
> …
> </Report>
> …
>
> Here is my data-config.xml:
>
>
> <dataConfig>
>
>  <dataSource type="FileDataSource" encoding="UTF-8" />
>
>
>  <document>
>
>    <entity name="fileload" rootEntity="false"
>
>            processor="FileListEntityProcessor" fileName="^.*\.xml$"
> recursive="false" baseDir="/files/xml/">
>
>      <entity name="report"
>
>            rootEntity="true" pk="id"
>
>              url="${fileload.fileAbsolutePath}"
> processor="XPathEntityProcessor"
>
>              forEach="/Report/MsgSet/Msg" onError="skip"
>
>              transformer="DateFormatTransformer,RegexTransformer">
>
>          <field column="DocumentText"
> xpath="/Report/MsgSet/Msg/DocumentText"/>
>
>          <field column="id" xpath="/Report/MsgSet/Msg/SourceDocID"/>
>
>  <field column="MsgCategory"
> xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Category" />
>
>  <field column="MsgKeyword"
> xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Keyword" />
>
>  <field column="MsgKeywordMF"
> xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword"
> />
>
>          …
>
>      </entity>
>
>    </entity>
>
>  </document>
>
> </dataConfig>
>
>
> As seen in my config and sample data above, I am extracting the repeating
> "Keywords" into the the MsgKeyword field.  Also, and the part that does NOT
> work, I am trying to extract into a separate field just the keywords that
> have a "Category" of "Manufacturer" -->   <field column="MsgKeywordMF"
> xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword"
> />
>
> I have also tried: <field column="MsgKeywordMF"
> xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keyword"
> />
> …after changing the "Category" to an attribute of MsgItem (<MsgItem
> Category="Location">) but it too fails to match.
>
> I have tested my xpath notation against my xml data file using various
> xpath evaluator tools, like within Eclipse, and it matches perfectly…but I
> can't get it to match/work during import.
>
> As I am able to understand it, DIH does not support nested/correlated
> entities, at least not with XML data sources using nested entity tags.  I've
> tried without success to nest entities but I can't "correlate" the nested
> entity with the parent.  I think the way I'm trying should work, but no luck
> so far….
>
> BTW, I can't easily change the xml format, although it is possible with
> some pain…
>
> Any ideas?
>
> TIA,
> -- Eric
>
>

Re: DIH help request: nested xml entities and xpath

Posted by "Weiss, Eric" <we...@llnl.gov>.
Thx kbootz for the reply.  I ended up writing a custom transformer that
seems to do what I need now.  I think I could make the script work, as you
suggested, too.  The script might even be preferable since I could
add/change/mod without recompiling.

Thx again,

-- Eric






On 5/14/11 1:12 PM, "kbootz" <kb...@caci.com> wrote:

>Have you tried using a scripttransformer per the wiki:
>http://wiki.apache.org/solr/DataImportHandler.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/DIH-help-request-nested-xml-entities-an
>d-xpath-tp2937919p2941151.html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH help request: nested xml entities and xpath

Posted by kbootz <kb...@caci.com>.
Have you tried using a scripttransformer per the wiki:
http://wiki.apache.org/solr/DataImportHandler. 



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-help-request-nested-xml-entities-and-xpath-tp2937919p2941151.html
Sent from the Solr - User mailing list archive at Nabble.com.