You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2013/02/04 17:57:51 UTC

2.x : Links with 404 status are not being updated from db_unfetched to db_gone

Hi!

I did a crawl on a single seed for 30 rounds and it has crawled around 16k
seeds. I have checked (readdb -stats) and it showed 2116 seeds as
unfetched. I ran the fetcher again with option 'all' but it does not fetch
anything and the unfetched list remains same.

I have dumped only the fields (baseURL, status, protocolStatus) and can be
found at (
https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-00000
).

The file clearly shows that urls with status 1 have the protocolStatus(NOT
FOUND). Those seeds are never moved to status (db_gone) that is status 3 if
i am correct.

Did anyone had a similar problem ? Any ideas on how to fix it ?

PS : I have made patch which dumps only particular fields through command
line (Example: ./bin/nutch readdb -dump table_fields -fields
"status,protocolStatus"). baseUrl is dumped by default along with other
fields requested. I can upload if anyone is interested.


Thanks,

-- 
Kiran Chitturi

Re: 2.x : Links with 404 status are not being updated from db_unfetched to db_gone

Posted by kiran chitturi <ch...@gmail.com>.
On Mon, Feb 4, 2013 at 7:18 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Kiran,
>
> You are using 2.x still?
>
> Yes, I am using 2.x version of Nutch.

 HttpBase [0] suggests that upon receipt of a 404 response code the

> ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears
> to be 14! [1].
> What are you expecting to happen here?
>
> Yes, the ProtocolStatus is changed to NOTFOUND but  i am talking about
fetch status which is still 1 (db_unfetched status) rather than assigning
it 3 (db_gone status).

We can see in this log file (
https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-00000)
that Urls with protocolStatus NOTFOUND have a fetch status of 1
(db_unfetched). Shouldn't they be changed from status 1 to status 3 ? The
second column in the log file is fetchStatus and third column is
protocolStatus

Due to this reason when i do (readdb -stats) there is inconsistency.

I am not sure if its a problem only for me or anyone else. I have did the
crawl from scratch 3-4 times.

>
> > PS : I have made patch which dumps only particular fields through command
> > line (Example: ./bin/nutch readdb -dump table_fields -fields
> > "status,protocolStatus"). baseUrl is dumped by default along with other
> > fields requested. I can upload if anyone is interested.
>
> Please file an issue and attach your patch. Any potential addition to
> the codebase is welcomed.,
>
Sure. Will do!

>



> [0]
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
> [1]
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java
>
> --
> Lewis
>



-- 
Kiran Chitturi

Re: 2.x : Links with 404 status are not being updated from db_unfetched to db_gone

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kiran,

You are using 2.x still?

On Mon, Feb 4, 2013 at 8:57 AM, kiran chitturi
<ch...@gmail.com> wrote:

>
> The file clearly shows that urls with status 1 have the protocolStatus(NOT
> FOUND). Those seeds are never moved to status (db_gone) that is status 3 if
> i am correct.
>
> Did anyone had a similar problem ? Any ideas on how to fix it ?

HttpBase [0] suggests that upon receipt of a 404 response code the
ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears
to be 14! [1].
What are you expecting to happen here?


> PS : I have made patch which dumps only particular fields through command
> line (Example: ./bin/nutch readdb -dump table_fields -fields
> "status,protocolStatus"). baseUrl is dumped by default along with other
> fields requested. I can upload if anyone is interested.

Please file an issue and attach your patch. Any potential addition to
the codebase is welcomed.,
Thanks.

[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
[1] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java

-- 
Lewis