You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/04/06 12:28:15 UTC

[jira] [Commented] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump

    [ https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961376#comment-13961376 ] 

Sebastian Nagel commented on NUTCH-1615:
----------------------------------------

No question, reading an entire [Wikimedia dump|http://dumps.wikimedia.org/backup-index.html] into web table would provide a nice playground to test content extraction, link rank algorithms, etc. Crawling Wikipedia is no alternative because of its size and because you are encouraged [not to do|http://en.wikipedia.org/wiki/Wikipedia:Download#Please_do_not_use_a_web_crawler]. There are already tools to process Wikipedia dumps via Hadoop (e.g., search for "[hadoop process wikipedia dump|https://www.google.com/search?q=hadoop%20process%20wikipedia%20dump]"). But wiki markup is quite complex, and to convert it properly to HTML there is hardly any other choice than to set up your own Mediawiki server and import Wikipedia dumps. The situation for other content management systems isn't better: usually dumps can be generated, but the format isn't standardized. Consequently, there will be probably no way to implement a generalized tool which allows to "fetch from website dumps".

> Implementing A Feature for Fetching From Websites Dump
> ------------------------------------------------------
>
>                 Key: NUTCH-1615
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1615
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 2.1
>            Reporter: cihad güzel
>            Priority: Minor
>
> Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for wikipedia.org). We should fetch from dumps for such kind of web sites. Thus fetching  will be quicker.



--
This message was sent by Atlassian JIRA
(v6.2#6252)