You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Farhan Ali <fa...@gmail.com> on 2014/03/06 01:14:14 UTC
Problem with indexing xml using DataImportHandler and XPath
Hi,
I am a newbie to Solr and I am trying to index some xml documents using DIH
and XPath but I am unable to do it. I get a response message of successful
indexing but no document is added to the index. I do not know what i m
doing wrong.
This is my data config xml file
<dataConfig>
<dataSource type="FileDataSource"/>
<document>
<entity name="nytxmldir" rootEntity="false"
datasource="null"
processor="FileListEntityProcessor"
fileName=".*\.xml"
recursive="true"
baseDir="/home/farhan/Downloads/nytxml"
>
<entity name="nytxml"
pk="id"
datasource="nytxmldir"
url="${nytxmldir.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/ntif"
transformer="RegexTransformer">
<field column="id"
xpath="/ntif/head/docdata/doc-id/@id-string"/>
<field column="title"
xpath="/ntif/head/title"/>
<field column="paragraph"
xpath="/ntif/body/body.content/block[@class='full_text']/p"/>
</entity>
</entity>
</document>
</dataConfig>
This is my xml document
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "
http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD
NITF 3.3//EN">
<head>
<title>Paid Notice: Deaths BRADLEY, CAROL L.</title>
<meta content="dn010107" name="slug"/>
<meta content="1" name="publication_day_of_month"/>
<meta content="1" name="publication_month"/>
<meta content="2007" name="publication_year"/>
<meta content="Monday" name="publication_day_of_week"/>
<meta content="Classified" name="dsk"/>
<meta content="7" name="print_page_number"/>
<meta content="B" name="print_section"/>
<meta content="3" name="print_column"/>
<meta content="Paid Death Notices" name="online_sections"/>
<docdata>
<doc-id id-string="1815719"/>
<doc.copyright holder="The New York Times" year="2007"/>
<identified-content>
<person class="indexing_service">BRADLEY, CAROL L.</person>
<classifier class="online_producer" type="types_of_material">Paid
Death Notice</classifier>
<classifier class="online_producer"
type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
</identified-content>
</docdata>
<pubdata date.publication="20070101T000000" ex-ref="
http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63"
item-length="49" name="The New York Times" unit-of-measure="word"/>
</head>
<body>
<body.head>
<hedline>
<hl1>Paid Notice: Deaths BRADLEY, CAROL L.</hl1>
</hedline>
</body.head>
<body.content>
<block class="lead_paragraph">
<p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
Victoria Kent and Camilla, William and Melissa Bradley; caring
stepgrandmother of Matthew and Charlton Field.</p>
</block>
<block class="full_text">
<p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
Victoria Kent and Camilla, William and Melissa Bradley; caring
stepgrandmother of Matthew and Charlton Field.</p>
</block>
</body.content>
</body>
</nitf>
I am really stumped as to why it is not working. I know DIH does not
support full XPath syntax but according to the wiki it supports the limited
XPath syntax that I am using. Also I have read various internet forums and
people have suggested to use groovy and xlts which I am unfamiliar with.
I hope someone can help me.
Thanks
Farhan
Re: Problem with indexing xml using DataImportHandler and XPath
Posted by Erick Erickson <er...@gmail.com>.
NP, Been there, done that, got the t-shirt :)...
On Wed, Mar 5, 2014 at 9:51 PM, Farhan Ali <fa...@gmail.com> wrote:
> Sorry figured out my problem. It was stupid mistake on my part. Once again
> sorry for that
>
> Thanks
> Farhan
>
>
> On Wed, Mar 5, 2014 at 7:14 PM, Farhan Ali <fa...@gmail.com> wrote:
>
>> Hi,
>> I am a newbie to Solr and I am trying to index some xml documents using
>> DIH and XPath but I am unable to do it. I get a response message of
>> successful indexing but no document is added to the index. I do not know
>> what i m doing wrong.
>>
>> This is my data config xml file
>>
>>
>> <dataConfig>
>> <dataSource type="FileDataSource"/>
>> <document>
>> <entity name="nytxmldir" rootEntity="false"
>> datasource="null"
>> processor="FileListEntityProcessor"
>> fileName=".*\.xml"
>> recursive="true"
>> baseDir="/home/farhan/Downloads/nytxml"
>> >
>>
>> <entity name="nytxml"
>> pk="id"
>> datasource="nytxmldir"
>> url="${nytxmldir.fileAbsolutePath}"
>> processor="XPathEntityProcessor"
>> forEach="/ntif"
>> transformer="RegexTransformer">
>>
>> <field column="id"
>> xpath="/ntif/head/docdata/doc-id/@id-string"/>
>> <field column="title"
>> xpath="/ntif/head/title"/>
>> <field column="paragraph"
>> xpath="/ntif/body/body.content/block[@class='full_text']/p"/>
>>
>> </entity>
>> </entity>
>> </document>
>> </dataConfig>
>>
>>
>>
>>
>>
>> This is my xml document
>>
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!DOCTYPE nitf SYSTEM "
>> http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
>> <nitf change.date="June 10, 2005" change.time="19:30"
>> version="-//IPTC//DTD NITF 3.3//EN">
>> <head>
>> <title>Paid Notice: Deaths BRADLEY, CAROL L.</title>
>> <meta content="dn010107" name="slug"/>
>> <meta content="1" name="publication_day_of_month"/>
>> <meta content="1" name="publication_month"/>
>> <meta content="2007" name="publication_year"/>
>> <meta content="Monday" name="publication_day_of_week"/>
>> <meta content="Classified" name="dsk"/>
>> <meta content="7" name="print_page_number"/>
>> <meta content="B" name="print_section"/>
>> <meta content="3" name="print_column"/>
>> <meta content="Paid Death Notices" name="online_sections"/>
>> <docdata>
>> <doc-id id-string="1815719"/>
>> <doc.copyright holder="The New York Times" year="2007"/>
>> <identified-content>
>> <person class="indexing_service">BRADLEY, CAROL L.</person>
>> <classifier class="online_producer" type="types_of_material">Paid
>> Death Notice</classifier>
>> <classifier class="online_producer"
>> type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
>> </identified-content>
>> </docdata>
>> <pubdata date.publication="20070101T000000" ex-ref="
>> http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63"
>> item-length="49" name="The New York Times" unit-of-measure="word"/>
>> </head>
>> <body>
>> <body.head>
>> <hedline>
>> <hl1>Paid Notice: Deaths BRADLEY, CAROL L.</hl1>
>> </hedline>
>> </body.head>
>> <body.content>
>> <block class="lead_paragraph">
>> <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
>> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
>> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
>> Victoria Kent and Camilla, William and Melissa Bradley; caring
>> stepgrandmother of Matthew and Charlton Field.</p>
>> </block>
>> <block class="full_text">
>> <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
>> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
>> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
>> Victoria Kent and Camilla, William and Melissa Bradley; caring
>> stepgrandmother of Matthew and Charlton Field.</p>
>> </block>
>> </body.content>
>> </body>
>> </nitf>
>>
>>
>> I am really stumped as to why it is not working. I know DIH does not
>> support full XPath syntax but according to the wiki it supports the limited
>> XPath syntax that I am using. Also I have read various internet forums and
>> people have suggested to use groovy and xlts which I am unfamiliar with.
>> I hope someone can help me.
>>
>> Thanks
>> Farhan
>>
>>
>>
>>
Re: Problem with indexing xml using DataImportHandler and XPath
Posted by Farhan Ali <fa...@gmail.com>.
Sorry figured out my problem. It was stupid mistake on my part. Once again
sorry for that
Thanks
Farhan
On Wed, Mar 5, 2014 at 7:14 PM, Farhan Ali <fa...@gmail.com> wrote:
> Hi,
> I am a newbie to Solr and I am trying to index some xml documents using
> DIH and XPath but I am unable to do it. I get a response message of
> successful indexing but no document is added to the index. I do not know
> what i m doing wrong.
>
> This is my data config xml file
>
>
> <dataConfig>
> <dataSource type="FileDataSource"/>
> <document>
> <entity name="nytxmldir" rootEntity="false"
> datasource="null"
> processor="FileListEntityProcessor"
> fileName=".*\.xml"
> recursive="true"
> baseDir="/home/farhan/Downloads/nytxml"
> >
>
> <entity name="nytxml"
> pk="id"
> datasource="nytxmldir"
> url="${nytxmldir.fileAbsolutePath}"
> processor="XPathEntityProcessor"
> forEach="/ntif"
> transformer="RegexTransformer">
>
> <field column="id"
> xpath="/ntif/head/docdata/doc-id/@id-string"/>
> <field column="title"
> xpath="/ntif/head/title"/>
> <field column="paragraph"
> xpath="/ntif/body/body.content/block[@class='full_text']/p"/>
>
> </entity>
> </entity>
> </document>
> </dataConfig>
>
>
>
>
>
> This is my xml document
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE nitf SYSTEM "
> http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> <nitf change.date="June 10, 2005" change.time="19:30"
> version="-//IPTC//DTD NITF 3.3//EN">
> <head>
> <title>Paid Notice: Deaths BRADLEY, CAROL L.</title>
> <meta content="dn010107" name="slug"/>
> <meta content="1" name="publication_day_of_month"/>
> <meta content="1" name="publication_month"/>
> <meta content="2007" name="publication_year"/>
> <meta content="Monday" name="publication_day_of_week"/>
> <meta content="Classified" name="dsk"/>
> <meta content="7" name="print_page_number"/>
> <meta content="B" name="print_section"/>
> <meta content="3" name="print_column"/>
> <meta content="Paid Death Notices" name="online_sections"/>
> <docdata>
> <doc-id id-string="1815719"/>
> <doc.copyright holder="The New York Times" year="2007"/>
> <identified-content>
> <person class="indexing_service">BRADLEY, CAROL L.</person>
> <classifier class="online_producer" type="types_of_material">Paid
> Death Notice</classifier>
> <classifier class="online_producer"
> type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
> </identified-content>
> </docdata>
> <pubdata date.publication="20070101T000000" ex-ref="
> http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63"
> item-length="49" name="The New York Times" unit-of-measure="word"/>
> </head>
> <body>
> <body.head>
> <hedline>
> <hl1>Paid Notice: Deaths BRADLEY, CAROL L.</hl1>
> </hedline>
> </body.head>
> <body.content>
> <block class="lead_paragraph">
> <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
> Victoria Kent and Camilla, William and Melissa Bradley; caring
> stepgrandmother of Matthew and Charlton Field.</p>
> </block>
> <block class="full_text">
> <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
> Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
> loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
> Victoria Kent and Camilla, William and Melissa Bradley; caring
> stepgrandmother of Matthew and Charlton Field.</p>
> </block>
> </body.content>
> </body>
> </nitf>
>
>
> I am really stumped as to why it is not working. I know DIH does not
> support full XPath syntax but according to the wiki it supports the limited
> XPath syntax that I am using. Also I have read various internet forums and
> people have suggested to use groovy and xlts which I am unfamiliar with.
> I hope someone can help me.
>
> Thanks
> Farhan
>
>
>
>