You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Iain Lopata <il...@hotmail.com> on 2014/03/05 13:18:50 UTC

Tika Parsing XML Incorrect Outlink Extraction

I am attempting to parse a page of mime-type application/xml using Tika.
The debug log shows that it is being parsed by
org.apache.tika.parser.xml.DcXMLParser.

 

However, if the document is structured as follows:

 

<urlset>

 
<url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
/lastmod><changefreq>monthly</changefreq></url>

</urlset>

 

I receive a single outlink which incorrectly concatenates the loc and
lastmod elements so that the outlink reads as:
http://www.example.com/index.html2014-02-15T01:30Z

 

If I reformat with carriage returns and line feeds, but in no other way
change the xml document so that it is now:

 

<urlset>

        <url>

<loc> <http://www.example.com/index.html%3c/loc>
http://www.example.com/index.html</loc>

<lastmod>2014-02-15T01:30Z</lastmod>

<changefreq>monthly</changefreq>

        </url>

</urlset>

 

I then receive two outlinks, the first being correct and the second being an
erroneous extraction from the lastmod element:
http://www.example.com/index.html and T01:30Z

 

If I then remove the colon from the time/datestamp in the lastmod element I
receive the single outlink http://www.example.com/index.html that I would
originally have expected.

 

Any ideas as to what might be going on and how I can correctly parse the
original document?  If Tika cannot parse this correctly shouldn't Nutch at
least perform a format validation on the returned outlinks and discard those
that are invalid?

 

Thanks!

RE: Tika Parsing XML Incorrect Outlink Extraction

Posted by Iain Lopata <il...@hotmail.com>.

Sebastian,

Thank you.  It is indeed a sitemap.

I had reviewed NUTCH-1465.

Interestingly however, I have successfully parsed literally dozens of other sitemaps using Tika.

Perhaps I had just been lucky so far?

Thanks

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Wednesday, March 05, 2014 8:57 AM
To: user@nutch.apache.org
Subject: Re: Tika Parsing XML Incorrect Outlink Extraction

Hi Iain,

the document looks like a sitemap (sitemaps.org).
Support for sitemaps is ongoing work, see NUTCH-1465.
The point is: you cannot expect that any XML-based format is properly parsed by Tika. In case of sitemaps, it's not only the outlinks but also re-fetch intervals and last-modified times which have to be transfered into Nutch' data structures.

Sebastian


On 03/05/2014 01:39 PM, Iain Lopata wrote:
> My apologies, but I realized that the formatting of my XML was not 
> preserved in the email.  Hopefully it is clear enough that in the 
> first case the <loc> <lastmod> and <changefreq> elements are all on 
> the same line and in the second case they have been moved to separate lines.
> 
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Wednesday, March 05, 2014 6:19 AM
> To: user@nutch.apache.org
> Subject: Tika Parsing XML Incorrect Outlink Extraction
> 
> I am attempting to parse a page of mime-type application/xml using Tika.
> The debug log shows that it is being parsed by 
> org.apache.tika.parser.xml.DcXMLParser.
> 
>  
> 
> However, if the document is structured as follows:
> 
>  
> 
> <urlset>
> 
>  
> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T0
> 1:30Z< /lastmod><changefreq>monthly</changefreq></url>
> 
> </urlset>
> 
>  
> 
> I receive a single outlink which incorrectly concatenates the loc and 
> lastmod elements so that the outlink reads as:
> http://www.example.com/index.html2014-02-15T01:30Z
> 
>  
> 
> If I reformat with carriage returns and line feeds, but in no other 
> way change the xml document so that it is now:
> 
>  
> 
> <urlset>
> 
>         <url>
> 
> <loc> <http://www.example.com/index.html%3c/loc>
> http://www.example.com/index.html</loc>
> 
> <lastmod>2014-02-15T01:30Z</lastmod>
> 
> <changefreq>monthly</changefreq>
> 
>         </url>
> 
> </urlset>
> 
>  
> 
> I then receive two outlinks, the first being correct and the second 
> being an erroneous extraction from the lastmod element:
> http://www.example.com/index.html and T01:30Z
> 
>  
> 
> If I then remove the colon from the time/datestamp in the lastmod 
> element I receive the single outlink http://www.example.com/index.html 
> that I would originally have expected.
> 
>  
> 
> Any ideas as to what might be going on and how I can correctly parse 
> the original document?  If Tika cannot parse this correctly shouldn't 
> Nutch at least perform a format validation on the returned outlinks 
> and discard those that are invalid?
> 
>  
> 
> Thanks!
> 
>  
> 
>

Re: Tika Parsing XML Incorrect Outlink Extraction

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Iain,

the document looks like a sitemap (sitemaps.org).
Support for sitemaps is ongoing work, see NUTCH-1465.
The point is: you cannot expect that any XML-based
format is properly parsed by Tika. In case of sitemaps,
it's not only the outlinks but also re-fetch intervals
and last-modified times which have to be transfered
into Nutch' data structures.

Sebastian


On 03/05/2014 01:39 PM, Iain Lopata wrote:
> My apologies, but I realized that the formatting of my XML was not preserved
> in the email.  Hopefully it is clear enough that in the first case the <loc>
> <lastmod> and <changefreq> elements are all on the same line and in the
> second case they have been moved to separate lines.
> 
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com] 
> Sent: Wednesday, March 05, 2014 6:19 AM
> To: user@nutch.apache.org
> Subject: Tika Parsing XML Incorrect Outlink Extraction
> 
> I am attempting to parse a page of mime-type application/xml using Tika.
> The debug log shows that it is being parsed by
> org.apache.tika.parser.xml.DcXMLParser.
> 
>  
> 
> However, if the document is structured as follows:
> 
>  
> 
> <urlset>
> 
>  
> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
> /lastmod><changefreq>monthly</changefreq></url>
> 
> </urlset>
> 
>  
> 
> I receive a single outlink which incorrectly concatenates the loc and
> lastmod elements so that the outlink reads as:
> http://www.example.com/index.html2014-02-15T01:30Z
> 
>  
> 
> If I reformat with carriage returns and line feeds, but in no other way
> change the xml document so that it is now:
> 
>  
> 
> <urlset>
> 
>         <url>
> 
> <loc> <http://www.example.com/index.html%3c/loc>
> http://www.example.com/index.html</loc>
> 
> <lastmod>2014-02-15T01:30Z</lastmod>
> 
> <changefreq>monthly</changefreq>
> 
>         </url>
> 
> </urlset>
> 
>  
> 
> I then receive two outlinks, the first being correct and the second being an
> erroneous extraction from the lastmod element:
> http://www.example.com/index.html and T01:30Z
> 
>  
> 
> If I then remove the colon from the time/datestamp in the lastmod element I
> receive the single outlink http://www.example.com/index.html that I would
> originally have expected.
> 
>  
> 
> Any ideas as to what might be going on and how I can correctly parse the
> original document?  If Tika cannot parse this correctly shouldn't Nutch at
> least perform a format validation on the returned outlinks and discard those
> that are invalid?
> 
>  
> 
> Thanks!
> 
>  
> 
>

RE: Tika Parsing XML Incorrect Outlink Extraction

Posted by Iain Lopata <il...@hotmail.com>.

My apologies, but I realized that the formatting of my XML was not preserved
in the email.  Hopefully it is clear enough that in the first case the <loc>
<lastmod> and <changefreq> elements are all on the same line and in the
second case they have been moved to separate lines.

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Wednesday, March 05, 2014 6:19 AM
To: user@nutch.apache.org
Subject: Tika Parsing XML Incorrect Outlink Extraction

I am attempting to parse a page of mime-type application/xml using Tika.
The debug log shows that it is being parsed by
org.apache.tika.parser.xml.DcXMLParser.

However, if the document is structured as follows:

<urlset>

<url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
/lastmod><changefreq>monthly</changefreq></url>

</urlset>

I receive a single outlink which incorrectly concatenates the loc and
lastmod elements so that the outlink reads as:
http://www.example.com/index.html2014-02-15T01:30Z

If I reformat with carriage returns and line feeds, but in no other way
change the xml document so that it is now:

<urlset>

        <url>

<loc> <http://www.example.com/index.html%3c/loc>
http://www.example.com/index.html</loc>

<lastmod>2014-02-15T01:30Z</lastmod>

<changefreq>monthly</changefreq>

        </url>

</urlset>

I then receive two outlinks, the first being correct and the second being an
erroneous extraction from the lastmod element:
http://www.example.com/index.html and T01:30Z

If I then remove the colon from the time/datestamp in the lastmod element I
receive the single outlink http://www.example.com/index.html that I would
originally have expected.

Any ideas as to what might be going on and how I can correctly parse the
original document?  If Tika cannot parse this correctly shouldn't Nutch at
least perform a format validation on the returned outlinks and discard those
that are invalid?

Thanks!