You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by th...@wellsfargo.com on 2012/03/22 20:32:26 UTC

canonical tag support

Ran across a posting for the Nutch roadmap mentioning support for the canonical tag.
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/nutch/Nutch2Roadmap
Is there any update as to when this support will be added to Nutch?






Re: canonical tag support

Posted by Julien Nioche <li...@gmail.com>.
Hi Sebastian

You can use
*
 <property>
  <name>db.parsemeta.to.crawldb</name>
  <value></value>
  <description>Comma-separated list of parse metadata keys to transfer to
the crawldb (NUTCH-779).
   Assuming for instance that the languageidentifier plugin is enabled,
setting the value to 'lang'
   will copy both the key 'lang' and its value to the corresponding entry
in the crawldb.
  </description>
</property>*

to put the value from the parse md to the crawl md.

I expect the gora branch to make the resolution of the chains simpler.

Thanks

Julien

On 23 March 2012 20:18, Sebastian Nagel <wa...@googlemail.com> wrote:

> Hi,
>
> there is already an issue open:
> https://issues.apache.org/**jira/browse/NUTCH-710<https://issues.apache.org/jira/browse/NUTCH-710>
>
> I've struggled with the rel=canonical tag right now.
> About 70% of the documents of the crawled site had this tag set.
> The quick solution was to write a parse filter that extracts the
> tag and an indexing filter that skips all documents with this tag.
> As Julien mentioned in the issue, this has the drawback that
> some content may get lost: docs with canonical tag pointing to gone
> documents, redirects or docs having themselves canonical tags.
> However, in my case it's by far better than so many duplicates.
> A real solution would be somewhat difficult, esp. for Nutch 1.x
> because to resolve chains of canonical tags and/or redirects
> would mean iterating several times over the data / CrawlDb.
>
> At least, what about writing the target of the canonical tag
> to CrawlDatum's meta? It would make a solution by iterating over CrawlDb
> possible. And an indexing filter that skips those URLs/documents
> would be trivial to implement. Any suggestions?
>
> Sebastian
>
>
> On 03/22/2012 08:40 PM, Markus Jelsma wrote:
>
>> This is not supported by Nutch and there's no issue ticket yet. Feel free
>> to open one.
>>
>> On Thu, 22 Mar 2012 14:32:26 -0500, <th...@wellsfargo.com>>
>> wrote:
>>
>>> Ran across a posting for the Nutch roadmap mentioning support for the
>>> canonical tag.
>>>
>>> http://www.lucidimagination.**com/search/link?url=http://**
>>> wiki.apache.org/nutch/**Nutch2Roadmap<http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/nutch/Nutch2Roadmap>
>>> Is there any update as to when this support will be added to Nutch?
>>>
>>
>>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: canonical tag support

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

there is already an issue open:
https://issues.apache.org/jira/browse/NUTCH-710

I've struggled with the rel=canonical tag right now.
About 70% of the documents of the crawled site had this tag set.
The quick solution was to write a parse filter that extracts the
tag and an indexing filter that skips all documents with this tag.
As Julien mentioned in the issue, this has the drawback that
some content may get lost: docs with canonical tag pointing to gone
documents, redirects or docs having themselves canonical tags.
However, in my case it's by far better than so many duplicates.
A real solution would be somewhat difficult, esp. for Nutch 1.x
because to resolve chains of canonical tags and/or redirects
would mean iterating several times over the data / CrawlDb.

At least, what about writing the target of the canonical tag
to CrawlDatum's meta? It would make a solution by iterating over CrawlDb
possible. And an indexing filter that skips those URLs/documents
would be trivial to implement. Any suggestions?

Sebastian

On 03/22/2012 08:40 PM, Markus Jelsma wrote:
> This is not supported by Nutch and there's no issue ticket yet. Feel free to open one.
>
> On Thu, 22 Mar 2012 14:32:26 -0500, <th...@wellsfargo.com> wrote:
>> Ran across a posting for the Nutch roadmap mentioning support for the
>> canonical tag.
>>
>> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/nutch/Nutch2Roadmap
>> Is there any update as to when this support will be added to Nutch?
>


Re: canonical tag support

Posted by Markus Jelsma <ma...@openindex.io>.
 This is not supported by Nutch and there's no issue ticket yet. Feel 
 free to open one.

 On Thu, 22 Mar 2012 14:32:26 -0500, <th...@wellsfargo.com> 
 wrote:
> Ran across a posting for the Nutch roadmap mentioning support for the
> canonical tag.
> 
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/nutch/Nutch2Roadmap
> Is there any update as to when this support will be added to Nutch?