You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2013/07/15 13:53:33 UTC

Re: Unfetched urls not being generated for fetching.

I got this to work by changing the query to the following.

scan 'webpage', {COLUMNS=>['f:bas', 'f:st'], FILTER=>SingleColumnValueFilter
.new(Bytes.toBytes('f'), Bytes.toBytes('st'),
CompareFilter::CompareOp.valueOf('EQUAL'),
Bytes.toBytes("\x00\x00\x00\x02"))}


On Fri, May 24, 2013 at 10:03 AM, Bai Shen <ba...@gmail.com> wrote:

> I'm trying to check hbase for urls that have unfetched status but my query
> isn't working correctly.  No matter what I don't get a match.
>
> scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
> FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
> Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
> Bytes.toBytes('1'))}
>
>
> I did manage to find one entry with an unfetched status.  It apparently
> has no base url, so I'm assuming that's why it's not fetched.  I'm not sure
> how that happened.  It also says protocolStatus is NOTFOUND.
>
>
> On Fri, May 24, 2013 at 9:48 AM, kiran chitturi <chitturikiran15@gmail.com
> > wrote:
>
>> I have seen this happen in Nutch 2.x.
>>
>> I would suggest you to check your regex file to see the conditions and use
>> hbase to get the urls that have unfetched status.
>>
>> Also, try to check the protocol status of each unfetched url in HBase,
>> most
>> probably it is either 404 or status other than 200.
>>
>> Hope this helps.
>>
>> On Fri, May 24, 2013 at 8:13 AM, Bai Shen <ba...@gmail.com>
>> wrote:
>>
>> > I'm running Nutch 2.1 using HBase.
>> >
>> > When I run readdb -stats I show that there are 15k unfetched urls.
>> >  However, when I run generate -topN 1000 I get no urls to be fetched.
>>  Up
>> > until now it's been pulling a full thousand urls for each cycle.
>> >
>> > Any ideas?  I'm not sure what to check.
>> >
>> > Thanks.
>> >
>>
>>
>>
>> --
>> Kiran Chitturi
>>
>> <http://www.linkedin.com/in/kiranchitturi>
>>
>
>