You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Farhan Ali <fa...@gmail.com> on 2014/03/06 01:14:14 UTC

Problem with indexing xml using DataImportHandler and XPath

Hi,
I am a newbie to Solr and I am trying to index some xml documents using DIH
and XPath but I am unable to do it. I get a response message of successful
indexing but no document is added to the index. I do not know what i m
doing wrong.

This is my data config xml file


<dataConfig>
        <dataSource type="FileDataSource"/>
                <document>
                        <entity name="nytxmldir" rootEntity="false"
datasource="null"
                        processor="FileListEntityProcessor"
                        fileName=".*\.xml"
                        recursive="true"
                        baseDir="/home/farhan/Downloads/nytxml"
                        >

                        <entity name="nytxml"
                        pk="id"
                        datasource="nytxmldir"
                        url="${nytxmldir.fileAbsolutePath}"
                        processor="XPathEntityProcessor"
                        forEach="/ntif"
                        transformer="RegexTransformer">

                                <field column="id"
xpath="/ntif/head/docdata/doc-id/@id-string"/>
                                <field column="title"
xpath="/ntif/head/title"/>
                                <field column="paragraph"
xpath="/ntif/body/body.content/block[@class='full_text']/p"/>

                        </entity>
                        </entity>
                </document>
</dataConfig>





This is my xml document


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "
http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD
NITF 3.3//EN">
  <head>
    <title>Paid Notice: Deaths   BRADLEY, CAROL L.</title>
    <meta content="dn010107" name="slug"/>
    <meta content="1" name="publication_day_of_month"/>
    <meta content="1" name="publication_month"/>
    <meta content="2007" name="publication_year"/>
    <meta content="Monday" name="publication_day_of_week"/>
    <meta content="Classified" name="dsk"/>
    <meta content="7" name="print_page_number"/>
    <meta content="B" name="print_section"/>
    <meta content="3" name="print_column"/>
    <meta content="Paid Death Notices" name="online_sections"/>
    <docdata>
      <doc-id id-string="1815719"/>
      <doc.copyright holder="The New York Times" year="2007"/>
      <identified-content>
        <person class="indexing_service">BRADLEY, CAROL L.</person>
        <classifier class="online_producer" type="types_of_material">Paid
Death Notice</classifier>
        <classifier class="online_producer"
type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
      </identified-content>
    </docdata>
    <pubdata date.publication="20070101T000000" ex-ref="
http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63"
item-length="49" name="The New York Times" unit-of-measure="word"/>
  </head>
  <body>
    <body.head>
      <hedline>
        <hl1>Paid Notice: Deaths   BRADLEY, CAROL L.</hl1>
      </hedline>
    </body.head>
    <body.content>
      <block class="lead_paragraph">
        <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
Victoria Kent and Camilla, William and Melissa Bradley; caring
stepgrandmother of Matthew and Charlton Field.</p>
      </block>
      <block class="full_text">
        <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
Victoria Kent and Camilla, William and Melissa Bradley; caring
stepgrandmother of Matthew and Charlton Field.</p>
      </block>
    </body.content>
  </body>
</nitf>


I am really stumped as to why it is not working. I know DIH does not
support full XPath syntax but according to the wiki it supports the limited
XPath syntax that I am using. Also I have read various internet forums and
people have suggested to use groovy and xlts which I am unfamiliar with.
I hope someone can help me.

Thanks
Farhan

Re: Problem with indexing xml using DataImportHandler and XPath

Posted by Erick Erickson <er...@gmail.com>.
NP, Been there, done that, got the t-shirt :)...


On Wed, Mar 5, 2014 at 9:51 PM, Farhan Ali <fa...@gmail.com> wrote:
> Sorry figured out my problem. It was stupid mistake on my part. Once again
> sorry for that
>
> Thanks
> Farhan
>
>
> On Wed, Mar 5, 2014 at 7:14 PM, Farhan Ali <fa...@gmail.com> wrote:
>
>> Hi,
>> I am a newbie to Solr and I am trying to index some xml documents using
>> DIH and XPath but I am unable to do it. I get a response message of
>> successful indexing but no document is added to the index. I do not know
>> what i m doing wrong.
>>
>> This is my data config xml file
>>
>>
>> <dataConfig>
>>         <dataSource type="FileDataSource"/>
>>                 <document>
>>                         <entity name="nytxmldir" rootEntity="false"
>> datasource="null"
>>                         processor="FileListEntityProcessor"
>>                         fileName=".*\.xml"
>>                         recursive="true"
>>                         baseDir="/home/farhan/Downloads/nytxml"
>>                         >
>>
>>                         <entity name="nytxml"
>>                         pk="id"
>>                         datasource="nytxmldir"
>>                         url="${nytxmldir.fileAbsolutePath}"
>>                         processor="XPathEntityProcessor"
>>                         forEach="/ntif"
>>                         transformer="RegexTransformer">
>>
>>                                 <field column="id"
>> xpath="/ntif/head/docdata/doc-id/@id-string"/>
>>                                 <field column="title"
>> xpath="/ntif/head/title"/>
>>                                 <field column="paragraph"
>> xpath="/ntif/body/body.content/block[@class='full_text']/p"/>
>>
>>                         </entity>
>>                         </entity>
>>                 </document>
>> </dataConfig>
>>
>>
>>
>>
>>
>> This is my xml document
>>
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!DOCTYPE nitf SYSTEM "
>> http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
>> <nitf change.date="June 10, 2005" change.time="19:30"
>> version="-//IPTC//DTD NITF 3.3//EN">
>>   <head>
>>     <title>Paid Notice: Deaths   BRADLEY, CAROL L.</title>
>>     <meta content="dn010107" name="slug"/>
>>     <meta content="1" name="publication_day_of_month"/>
>>     <meta content="1" name="publication_month"/>
>>     <meta content="2007" name="publication_year"/>
>>     <meta content="Monday" name="publication_day_of_week"/>
>>     <meta content="Classified" name="dsk"/>
>>     <meta content="7" name="print_page_number"/>
>>     <meta content="B" name="print_section"/>
>>     <meta content="3" name="print_column"/>
>>     <meta content="Paid Death Notices" name="online_sections"/>
>>     <docdata>
>>       <doc-id id-string="1815719"/>
>>       <doc.copyright holder="The New York Times" year="2007"/>
>>       <identified-content>
>>         <person class="indexing_service">BRADLEY, CAROL L.</person>
>>         <classifier class="online_producer" type="types_of_material">Paid
>> Death Notice</classifier>
>>         <classifier class="online_producer"
>> type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
>>       </identified-content>
>>     </docdata>
>>     <pubdata date.publication="20070101T000000" ex-ref="
>> http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63"
>> item-length="49" name="The New York Times" unit-of-measure="word"/>
>>   </head>
>>   <body>
>>     <body.head>
>>       <hedline>
>>         <hl1>Paid Notice: Deaths   BRADLEY, CAROL L.</hl1>
>>       </hedline>
>>     </body.head>
>>     <body.content>
>>       <block class="lead_paragraph">
>>         <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
>> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
>> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
>> Victoria Kent and Camilla, William and Melissa Bradley; caring
>> stepgrandmother of Matthew and Charlton Field.</p>
>>       </block>
>>       <block class="full_text">
>>         <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
>> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
>> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
>> Victoria Kent and Camilla, William and Melissa Bradley; caring
>> stepgrandmother of Matthew and Charlton Field.</p>
>>       </block>
>>     </body.content>
>>   </body>
>> </nitf>
>>
>>
>> I am really stumped as to why it is not working. I know DIH does not
>> support full XPath syntax but according to the wiki it supports the limited
>> XPath syntax that I am using. Also I have read various internet forums and
>> people have suggested to use groovy and xlts which I am unfamiliar with.
>> I hope someone can help me.
>>
>> Thanks
>> Farhan
>>
>>
>>
>>

Re: Problem with indexing xml using DataImportHandler and XPath

Posted by Farhan Ali <fa...@gmail.com>.
Sorry figured out my problem. It was stupid mistake on my part. Once again
sorry for that

Thanks
Farhan


On Wed, Mar 5, 2014 at 7:14 PM, Farhan Ali <fa...@gmail.com> wrote:

> Hi,
> I am a newbie to Solr and I am trying to index some xml documents using
> DIH and XPath but I am unable to do it. I get a response message of
> successful indexing but no document is added to the index. I do not know
> what i m doing wrong.
>
> This is my data config xml file
>
>
> <dataConfig>
>         <dataSource type="FileDataSource"/>
>                 <document>
>                         <entity name="nytxmldir" rootEntity="false"
> datasource="null"
>                         processor="FileListEntityProcessor"
>                         fileName=".*\.xml"
>                         recursive="true"
>                         baseDir="/home/farhan/Downloads/nytxml"
>                         >
>
>                         <entity name="nytxml"
>                         pk="id"
>                         datasource="nytxmldir"
>                         url="${nytxmldir.fileAbsolutePath}"
>                         processor="XPathEntityProcessor"
>                         forEach="/ntif"
>                         transformer="RegexTransformer">
>
>                                 <field column="id"
> xpath="/ntif/head/docdata/doc-id/@id-string"/>
>                                 <field column="title"
> xpath="/ntif/head/title"/>
>                                 <field column="paragraph"
> xpath="/ntif/body/body.content/block[@class='full_text']/p"/>
>
>                         </entity>
>                         </entity>
>                 </document>
> </dataConfig>
>
>
>
>
>
> This is my xml document
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE nitf SYSTEM "
> http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> <nitf change.date="June 10, 2005" change.time="19:30"
> version="-//IPTC//DTD NITF 3.3//EN">
>   <head>
>     <title>Paid Notice: Deaths   BRADLEY, CAROL L.</title>
>     <meta content="dn010107" name="slug"/>
>     <meta content="1" name="publication_day_of_month"/>
>     <meta content="1" name="publication_month"/>
>     <meta content="2007" name="publication_year"/>
>     <meta content="Monday" name="publication_day_of_week"/>
>     <meta content="Classified" name="dsk"/>
>     <meta content="7" name="print_page_number"/>
>     <meta content="B" name="print_section"/>
>     <meta content="3" name="print_column"/>
>     <meta content="Paid Death Notices" name="online_sections"/>
>     <docdata>
>       <doc-id id-string="1815719"/>
>       <doc.copyright holder="The New York Times" year="2007"/>
>       <identified-content>
>         <person class="indexing_service">BRADLEY, CAROL L.</person>
>         <classifier class="online_producer" type="types_of_material">Paid
> Death Notice</classifier>
>         <classifier class="online_producer"
> type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
>       </identified-content>
>     </docdata>
>     <pubdata date.publication="20070101T000000" ex-ref="
> http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63"
> item-length="49" name="The New York Times" unit-of-measure="word"/>
>   </head>
>   <body>
>     <body.head>
>       <hedline>
>         <hl1>Paid Notice: Deaths   BRADLEY, CAROL L.</hl1>
>       </hedline>
>     </body.head>
>     <body.content>
>       <block class="lead_paragraph">
>         <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
> Victoria Kent and Camilla, William and Melissa Bradley; caring
> stepgrandmother of Matthew and Charlton Field.</p>
>       </block>
>       <block class="full_text">
>         <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
> Victoria Kent and Camilla, William and Melissa Bradley; caring
> stepgrandmother of Matthew and Charlton Field.</p>
>       </block>
>     </body.content>
>   </body>
> </nitf>
>
>
> I am really stumped as to why it is not working. I know DIH does not
> support full XPath syntax but according to the wiki it supports the limited
> XPath syntax that I am using. Also I have read various internet forums and
> people have suggested to use groovy and xlts which I am unfamiliar with.
> I hope someone can help me.
>
> Thanks
> Farhan
>
>
>
>