You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2013/07/15 13:53:33 UTC
Re: Unfetched urls not being generated for fetching.
I got this to work by changing the query to the following.
scan 'webpage', {COLUMNS=>['f:bas', 'f:st'], FILTER=>SingleColumnValueFilter
.new(Bytes.toBytes('f'), Bytes.toBytes('st'),
CompareFilter::CompareOp.valueOf('EQUAL'),
Bytes.toBytes("\x00\x00\x00\x02"))}
On Fri, May 24, 2013 at 10:03 AM, Bai Shen <ba...@gmail.com> wrote:
> I'm trying to check hbase for urls that have unfetched status but my query
> isn't working correctly. No matter what I don't get a match.
>
> scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
> FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
> Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
> Bytes.toBytes('1'))}
>
>
> I did manage to find one entry with an unfetched status. It apparently
> has no base url, so I'm assuming that's why it's not fetched. I'm not sure
> how that happened. It also says protocolStatus is NOTFOUND.
>
>
> On Fri, May 24, 2013 at 9:48 AM, kiran chitturi <chitturikiran15@gmail.com
> > wrote:
>
>> I have seen this happen in Nutch 2.x.
>>
>> I would suggest you to check your regex file to see the conditions and use
>> hbase to get the urls that have unfetched status.
>>
>> Also, try to check the protocol status of each unfetched url in HBase,
>> most
>> probably it is either 404 or status other than 200.
>>
>> Hope this helps.
>>
>> On Fri, May 24, 2013 at 8:13 AM, Bai Shen <ba...@gmail.com>
>> wrote:
>>
>> > I'm running Nutch 2.1 using HBase.
>> >
>> > When I run readdb -stats I show that there are 15k unfetched urls.
>> > However, when I run generate -topN 1000 I get no urls to be fetched.
>> Up
>> > until now it's been pulling a full thousand urls for each cycle.
>> >
>> > Any ideas? I'm not sure what to check.
>> >
>> > Thanks.
>> >
>>
>>
>>
>> --
>> Kiran Chitturi
>>
>> <http://www.linkedin.com/in/kiranchitturi>
>>
>
>