You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2016/11/04 16:16:29 UTC

how to insert outlinks from rss in crawldb ?

Hi.
I am using nutch 1.12 and solr 4.10.3.

I know that rss feeds are parsed by tika by default but also can be added feed plugin to parse feed urls.
Rss is a significant way to discover new url and it is very important for me.
In my case i have activated only tika parser because using both (tika and feed) the field content and outlinks are empty in solr.
for any reason don´t extract outlinks correctly.but using only tika it is extracted very well.
I have a problem because the outlinks detected inside a feed are indexed correctly as a field outlink of the url, but not included in crawldb as urls. See below

http://www.cubadebate.cu/feed/

"url": "http://www.cubadebate.cu/feed/",
        "content": "..........",
        "tstamp": "2016-11-04T14:33:21.561Z",
        "segment": "20161104100311",
        "domain": "cubadebate.cu",
        "digest": "86af35325f0dcb671d24587ccda4ab64",
        "host": "www.cubadebate.cu",
        "boost": 1,
        "contentLength": 4085,
        "outlinks": [
          "http://www.cubadebate.cu/noticias/2016/11/04/fifa-cristiano-griezmann-y-messi-entre-los-23-nominados-al-trofeo-the-best/",
          "http://www.cubadebate.cu/noticias/2016/11/04/unicef-600-000-ninos-en-haiti-afectado-por-huracan-necesitan-ayuda/",
          "http://www.cubadebate.cu/noticias/2016/11/04/juan-pablo-escobar-el-dinero-de-la-droga-nunca-abandona-estados-unidos-video/",
          "http://www.cubadebate.cu/noticias/2016/11/04/que-trae-la-prensa-cubana-viernes-4-de-noviembre-de-2016/",
          "http://www.cubadebate.cu/noticias/2016/11/04/la-participacion-a-las-elecciones-de-eeuu-es-una-de-las-mas-bajas-del-mundo-desde-1980/",
          "http://www.cubadebate.cu/noticias/2016/11/04/acuerdo-de-paris-sobre-cambio-climatico-entra-en-vigor-este-viernes/",
          "http://www.cubadebate.cu/opinion/2016/11/04/brasil-detras-del-show-la-despolitizacion/",
          "http://www.cubadebate.cu/noticias/2016/11/03/inicia-jetblue-vuelos-regulares-a-camaguey/",
          "http://www.cubadebate.cu/noticias/2016/11/03/beisbol-tarde-de-lechadas-y-otra-noche-de-saavedra/",
          "http://www.cubadebate.cu/noticias/2016/11/03/que-traen-las-empresas-cubanas-a-fihav-2016/"
        ],
        "id": "http://www.cubadebate.cu/feed/",


After finish the crawl process only 1 url is in crawldb.
bin/nutch readdb crawl/crawldb/ -stats

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:	1
retry 0:	1
min score:	1.0
avg score:	1.0
max score:	1.0
status 2 (db_fetched):	1
CrawlDb statistics: done

I have read crawldb,linkdb,linkdbMerger classes but i can´t find how to insert outlinks from feed to crawldb.
Please any body can help me or point me in the right direction for insert outlinks from feed to crawldb, and visit its in the next round.









 E don´t are inserted in crawldb and also don´t visited in next iterations of c

how to insert outlinks from rss in crawldb ?

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Hi.
I am using nutch 1.12 and solr 4.10.3.

Rss is a significant way to discover new url to fetch.

All links detected in a rss are not inserted in crawldb as new urls.
any body can tell me why?
Please any body can help me or point me in the right direction for insert outlinks from feed to crawldb, and visit its in the next iteration.

I have activated only tika parser because using both (tika and feed) the field content and outlinks are empty in solr.










 E don´t are inserted in crawldb and also don´t visited in next iterations of c