You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manikandan Saravanan <ma...@thesocialpeople.net> on 2014/05/27 05:18:55 UTC

Total fetched URLs is 0.

Hi,

I’m running Nutch 2 on a 2-node Hadoop cluster. I’m also running Solr 4 on a separate machine accessible by private IP. I run the crawl command by doing the following.

bin/crawl urls/seed.txt TestCrawl <solrUrl> 2

My problem is that no URLs are fetched. And thus, nothing is indexed. When I run stats, this is what I get

{db_stats-job_201405261214_0043=
	{
		jobID=job_201405261214_0043,
		jobName=db_stats,
		counters=
			{File Input Format Counters ={BYTES_READ=0},
			Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=7990, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=9980},
			Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=218103808, CPU_MILLISECONDS=1950, SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=296411136, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2251104256, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017, FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/05/26 23:12:34 INFO crawl.WebTableReader: TOTAL urls:	0
14/05/26 23:12:34 INFO crawl.WebTableReader: WebTable statistics: done

What am I missing? My regex and normalise filters are allowing all URL patterns. I’m trying to do a whole web crawl.

-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

Re: Total fetched URLs is 0.

Posted by Julien Nioche <li...@gmail.com>.
I don't think Hadoop 2 has been mentioned at all.


On 28 May 2014 09:43, Talat Uyarer <ta...@uyarer.com> wrote:

> Hi Manikandan,
>
> Did you check your datastore after injecterjob? Has it some rows ? Normally
> Gora does not support Hadoop 2.x. You should change gora's dependecies. I
> will send my patch to Gora-144
>
> Talat
> 27 May 2014 06:19 tarihinde "Manikandan Saravanan" <
> manikandan@thesocialpeople.net> yazdı:
>
> > Hi,
> >
> > I’m running Nutch 2 on a 2-node Hadoop cluster. I’m also running Solr 4
> on
> > a separate machine accessible by private IP. I run the crawl command by
> > doing the following.
> >
> > bin/crawl urls/seed.txt TestCrawl <solrUrl> 2
> >
> > My problem is that no URLs are fetched. And thus, nothing is indexed.
> When
> > I run stats, this is what I get
> >
> > {db_stats-job_201405261214_0043=
> >         {
> >                 jobID=job_201405261214_0043,
> >                 jobName=db_stats,
> >                 counters=
> >                         {File Input Format Counters ={BYTES_READ=0},
> >                         Job Counters ={TOTAL_LAUNCHED_REDUCES=1,
> > SLOTS_MILLIS_MAPS=7990, FALLOW_SLOTS_MILLIS_REDUCES=0,
> > FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1,
> > SLOTS_MILLIS_REDUCES=9980},
> >                         Map-Reduce
> > Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
> > REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
> > COMMITTED_HEAP_BYTES=218103808, CPU_MILLISECONDS=1950,
> > SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0,
> > REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
> > PHYSICAL_MEMORY_BYTES=296411136, REDUCE_OUTPUT_RECORDS=0,
> > VIRTUAL_MEMORY_BYTES=2251104256, MAP_OUTPUT_RECORDS=0},
> > FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017,
> > FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format
> > Counters ={BYTES_WRITTEN=86}}}}
> > 14/05/26 23:12:34 INFO crawl.WebTableReader: TOTAL urls:        0
> > 14/05/26 23:12:34 INFO crawl.WebTableReader: WebTable statistics: done
> >
> > What am I missing? My regex and normalise filters are allowing all URL
> > patterns. I’m trying to do a whole web crawl.
> >
> > --
> > Manikandan Saravanan
> > Architect - Technology
> > TheSocialPeople
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Total fetched URLs is 0.

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Manikandan,

Did you check your datastore after injecterjob? Has it some rows ? Normally
Gora does not support Hadoop 2.x. You should change gora's dependecies. I
will send my patch to Gora-144

Talat
27 May 2014 06:19 tarihinde "Manikandan Saravanan" <
manikandan@thesocialpeople.net> yazdı:

> Hi,
>
> I’m running Nutch 2 on a 2-node Hadoop cluster. I’m also running Solr 4 on
> a separate machine accessible by private IP. I run the crawl command by
> doing the following.
>
> bin/crawl urls/seed.txt TestCrawl <solrUrl> 2
>
> My problem is that no URLs are fetched. And thus, nothing is indexed. When
> I run stats, this is what I get
>
> {db_stats-job_201405261214_0043=
>         {
>                 jobID=job_201405261214_0043,
>                 jobName=db_stats,
>                 counters=
>                         {File Input Format Counters ={BYTES_READ=0},
>                         Job Counters ={TOTAL_LAUNCHED_REDUCES=1,
> SLOTS_MILLIS_MAPS=7990, FALLOW_SLOTS_MILLIS_REDUCES=0,
> FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1,
> SLOTS_MILLIS_REDUCES=9980},
>                         Map-Reduce
> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
> REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
> COMMITTED_HEAP_BYTES=218103808, CPU_MILLISECONDS=1950,
> SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0,
> REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
> PHYSICAL_MEMORY_BYTES=296411136, REDUCE_OUTPUT_RECORDS=0,
> VIRTUAL_MEMORY_BYTES=2251104256, MAP_OUTPUT_RECORDS=0},
> FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017,
> FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format
> Counters ={BYTES_WRITTEN=86}}}}
> 14/05/26 23:12:34 INFO crawl.WebTableReader: TOTAL urls:        0
> 14/05/26 23:12:34 INFO crawl.WebTableReader: WebTable statistics: done
>
> What am I missing? My regex and normalise filters are allowing all URL
> patterns. I’m trying to do a whole web crawl.
>
> --
> Manikandan Saravanan
> Architect - Technology
> TheSocialPeople