You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Luis Magaña <lu...@euphorica.com> on 2016/03/09 22:07:46 UTC

Large seed Inject Slow to Accumulo

Hello,

I've setup a small sample hadoop cluster of 6 servers, hdfs, zookeeper,
solr and accumulo.

I am running nutch on top of the hadoop cluster and injecting 10,000
URLs in the seed.txt file.

Everything works as it should, nothing breaks, everything indexes, etc.,
and the crawl jobs finishes OK. However, the inject stage of those
10,000 URLs takes up to 50 minutes.

I wonder if that is a normal time for an inject or if I should be
looking at a possible problem (maybe the gora accumulo module?) or if I
am simply being naive and my seed.txt should not be so large to begin with.

A bit more information about my setup:

Hadoop 2.7.2
Accumulo 1.5.1
Solr 4.10.3

Currently accumulo has about 500 tables with some 200 Million entries
(not sure if that affects), Accumulo logs show no major errors or
warnings or java exceptions either, neither do the mapreduce logs in hadoop.

Thank you very much for your help and your excellent crawler.


-- 
Luis Magaña
www.euphorica.com