You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Iain Lopata <il...@hotmail.com> on 2014/03/05 13:18:50 UTC
Tika Parsing XML Incorrect Outlink Extraction
I am attempting to parse a page of mime-type application/xml using Tika.
The debug log shows that it is being parsed by
org.apache.tika.parser.xml.DcXMLParser.
However, if the document is structured as follows:
<urlset>
<url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
/lastmod><changefreq>monthly</changefreq></url>
</urlset>
I receive a single outlink which incorrectly concatenates the loc and
lastmod elements so that the outlink reads as:
http://www.example.com/index.html2014-02-15T01:30Z
If I reformat with carriage returns and line feeds, but in no other way
change the xml document so that it is now:
<urlset>
<url>
<loc> <http://www.example.com/index.html%3c/loc>
http://www.example.com/index.html</loc>
<lastmod>2014-02-15T01:30Z</lastmod>
<changefreq>monthly</changefreq>
</url>
</urlset>
I then receive two outlinks, the first being correct and the second being an
erroneous extraction from the lastmod element:
http://www.example.com/index.html and T01:30Z
If I then remove the colon from the time/datestamp in the lastmod element I
receive the single outlink http://www.example.com/index.html that I would
originally have expected.
Any ideas as to what might be going on and how I can correctly parse the
original document? If Tika cannot parse this correctly shouldn't Nutch at
least perform a format validation on the returned outlinks and discard those
that are invalid?
Thanks!
RE: Tika Parsing XML Incorrect Outlink Extraction
Posted by Iain Lopata <il...@hotmail.com>.
Sebastian,
Thank you. It is indeed a sitemap.
I had reviewed NUTCH-1465.
Interestingly however, I have successfully parsed literally dozens of other sitemaps using Tika.
Perhaps I had just been lucky so far?
Thanks
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Wednesday, March 05, 2014 8:57 AM
To: user@nutch.apache.org
Subject: Re: Tika Parsing XML Incorrect Outlink Extraction
Hi Iain,
the document looks like a sitemap (sitemaps.org).
Support for sitemaps is ongoing work, see NUTCH-1465.
The point is: you cannot expect that any XML-based format is properly parsed by Tika. In case of sitemaps, it's not only the outlinks but also re-fetch intervals and last-modified times which have to be transfered into Nutch' data structures.
Sebastian
On 03/05/2014 01:39 PM, Iain Lopata wrote:
> My apologies, but I realized that the formatting of my XML was not
> preserved in the email. Hopefully it is clear enough that in the
> first case the <loc> <lastmod> and <changefreq> elements are all on
> the same line and in the second case they have been moved to separate lines.
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Wednesday, March 05, 2014 6:19 AM
> To: user@nutch.apache.org
> Subject: Tika Parsing XML Incorrect Outlink Extraction
>
> I am attempting to parse a page of mime-type application/xml using Tika.
> The debug log shows that it is being parsed by
> org.apache.tika.parser.xml.DcXMLParser.
>
>
>
> However, if the document is structured as follows:
>
>
>
> <urlset>
>
>
> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T0
> 1:30Z< /lastmod><changefreq>monthly</changefreq></url>
>
> </urlset>
>
>
>
> I receive a single outlink which incorrectly concatenates the loc and
> lastmod elements so that the outlink reads as:
> http://www.example.com/index.html2014-02-15T01:30Z
>
>
>
> If I reformat with carriage returns and line feeds, but in no other
> way change the xml document so that it is now:
>
>
>
> <urlset>
>
> <url>
>
> <loc> <http://www.example.com/index.html%3c/loc>
> http://www.example.com/index.html</loc>
>
> <lastmod>2014-02-15T01:30Z</lastmod>
>
> <changefreq>monthly</changefreq>
>
> </url>
>
> </urlset>
>
>
>
> I then receive two outlinks, the first being correct and the second
> being an erroneous extraction from the lastmod element:
> http://www.example.com/index.html and T01:30Z
>
>
>
> If I then remove the colon from the time/datestamp in the lastmod
> element I receive the single outlink http://www.example.com/index.html
> that I would originally have expected.
>
>
>
> Any ideas as to what might be going on and how I can correctly parse
> the original document? If Tika cannot parse this correctly shouldn't
> Nutch at least perform a format validation on the returned outlinks
> and discard those that are invalid?
>
>
>
> Thanks!
>
>
>
>
Re: Tika Parsing XML Incorrect Outlink Extraction
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Iain,
the document looks like a sitemap (sitemaps.org).
Support for sitemaps is ongoing work, see NUTCH-1465.
The point is: you cannot expect that any XML-based
format is properly parsed by Tika. In case of sitemaps,
it's not only the outlinks but also re-fetch intervals
and last-modified times which have to be transfered
into Nutch' data structures.
Sebastian
On 03/05/2014 01:39 PM, Iain Lopata wrote:
> My apologies, but I realized that the formatting of my XML was not preserved
> in the email. Hopefully it is clear enough that in the first case the <loc>
> <lastmod> and <changefreq> elements are all on the same line and in the
> second case they have been moved to separate lines.
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Wednesday, March 05, 2014 6:19 AM
> To: user@nutch.apache.org
> Subject: Tika Parsing XML Incorrect Outlink Extraction
>
> I am attempting to parse a page of mime-type application/xml using Tika.
> The debug log shows that it is being parsed by
> org.apache.tika.parser.xml.DcXMLParser.
>
>
>
> However, if the document is structured as follows:
>
>
>
> <urlset>
>
>
> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
> /lastmod><changefreq>monthly</changefreq></url>
>
> </urlset>
>
>
>
> I receive a single outlink which incorrectly concatenates the loc and
> lastmod elements so that the outlink reads as:
> http://www.example.com/index.html2014-02-15T01:30Z
>
>
>
> If I reformat with carriage returns and line feeds, but in no other way
> change the xml document so that it is now:
>
>
>
> <urlset>
>
> <url>
>
> <loc> <http://www.example.com/index.html%3c/loc>
> http://www.example.com/index.html</loc>
>
> <lastmod>2014-02-15T01:30Z</lastmod>
>
> <changefreq>monthly</changefreq>
>
> </url>
>
> </urlset>
>
>
>
> I then receive two outlinks, the first being correct and the second being an
> erroneous extraction from the lastmod element:
> http://www.example.com/index.html and T01:30Z
>
>
>
> If I then remove the colon from the time/datestamp in the lastmod element I
> receive the single outlink http://www.example.com/index.html that I would
> originally have expected.
>
>
>
> Any ideas as to what might be going on and how I can correctly parse the
> original document? If Tika cannot parse this correctly shouldn't Nutch at
> least perform a format validation on the returned outlinks and discard those
> that are invalid?
>
>
>
> Thanks!
>
>
>
>
RE: Tika Parsing XML Incorrect Outlink Extraction
Posted by Iain Lopata <il...@hotmail.com>.
My apologies, but I realized that the formatting of my XML was not preserved
in the email. Hopefully it is clear enough that in the first case the <loc>
<lastmod> and <changefreq> elements are all on the same line and in the
second case they have been moved to separate lines.
-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com]
Sent: Wednesday, March 05, 2014 6:19 AM
To: user@nutch.apache.org
Subject: Tika Parsing XML Incorrect Outlink Extraction
I am attempting to parse a page of mime-type application/xml using Tika.
The debug log shows that it is being parsed by
org.apache.tika.parser.xml.DcXMLParser.
However, if the document is structured as follows:
<urlset>
<url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
/lastmod><changefreq>monthly</changefreq></url>
</urlset>
I receive a single outlink which incorrectly concatenates the loc and
lastmod elements so that the outlink reads as:
http://www.example.com/index.html2014-02-15T01:30Z
If I reformat with carriage returns and line feeds, but in no other way
change the xml document so that it is now:
<urlset>
<url>
<loc> <http://www.example.com/index.html%3c/loc>
http://www.example.com/index.html</loc>
<lastmod>2014-02-15T01:30Z</lastmod>
<changefreq>monthly</changefreq>
</url>
</urlset>
I then receive two outlinks, the first being correct and the second being an
erroneous extraction from the lastmod element:
http://www.example.com/index.html and T01:30Z
If I then remove the colon from the time/datestamp in the lastmod element I
receive the single outlink http://www.example.com/index.html that I would
originally have expected.
Any ideas as to what might be going on and how I can correctly parse the
original document? If Tika cannot parse this correctly shouldn't Nutch at
least perform a format validation on the returned outlinks and discard those
that are invalid?
Thanks!