You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2013/02/04 17:57:51 UTC
2.x : Links with 404 status are not being updated from db_unfetched
to db_gone
Hi!
I did a crawl on a single seed for 30 rounds and it has crawled around 16k
seeds. I have checked (readdb -stats) and it showed 2116 seeds as
unfetched. I ran the fetcher again with option 'all' but it does not fetch
anything and the unfetched list remains same.
I have dumped only the fields (baseURL, status, protocolStatus) and can be
found at (
https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-00000
).
The file clearly shows that urls with status 1 have the protocolStatus(NOT
FOUND). Those seeds are never moved to status (db_gone) that is status 3 if
i am correct.
Did anyone had a similar problem ? Any ideas on how to fix it ?
PS : I have made patch which dumps only particular fields through command
line (Example: ./bin/nutch readdb -dump table_fields -fields
"status,protocolStatus"). baseUrl is dumped by default along with other
fields requested. I can upload if anyone is interested.
Thanks,
--
Kiran Chitturi
Re: 2.x : Links with 404 status are not being updated from
db_unfetched to db_gone
Posted by kiran chitturi <ch...@gmail.com>.
On Mon, Feb 4, 2013 at 7:18 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Hi Kiran,
>
> You are using 2.x still?
>
> Yes, I am using 2.x version of Nutch.
HttpBase [0] suggests that upon receipt of a 404 response code the
> ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears
> to be 14! [1].
> What are you expecting to happen here?
>
> Yes, the ProtocolStatus is changed to NOTFOUND but i am talking about
fetch status which is still 1 (db_unfetched status) rather than assigning
it 3 (db_gone status).
We can see in this log file (
https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-00000)
that Urls with protocolStatus NOTFOUND have a fetch status of 1
(db_unfetched). Shouldn't they be changed from status 1 to status 3 ? The
second column in the log file is fetchStatus and third column is
protocolStatus
Due to this reason when i do (readdb -stats) there is inconsistency.
I am not sure if its a problem only for me or anyone else. I have did the
crawl from scratch 3-4 times.
>
> > PS : I have made patch which dumps only particular fields through command
> > line (Example: ./bin/nutch readdb -dump table_fields -fields
> > "status,protocolStatus"). baseUrl is dumped by default along with other
> > fields requested. I can upload if anyone is interested.
>
> Please file an issue and attach your patch. Any potential addition to
> the codebase is welcomed.,
>
Sure. Will do!
>
> [0]
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
> [1]
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java
>
> --
> Lewis
>
--
Kiran Chitturi
Re: 2.x : Links with 404 status are not being updated from
db_unfetched to db_gone
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kiran,
You are using 2.x still?
On Mon, Feb 4, 2013 at 8:57 AM, kiran chitturi
<ch...@gmail.com> wrote:
>
> The file clearly shows that urls with status 1 have the protocolStatus(NOT
> FOUND). Those seeds are never moved to status (db_gone) that is status 3 if
> i am correct.
>
> Did anyone had a similar problem ? Any ideas on how to fix it ?
HttpBase [0] suggests that upon receipt of a 404 response code the
ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears
to be 14! [1].
What are you expecting to happen here?
> PS : I have made patch which dumps only particular fields through command
> line (Example: ./bin/nutch readdb -dump table_fields -fields
> "status,protocolStatus"). baseUrl is dumped by default along with other
> fields requested. I can upload if anyone is interested.
Please file an issue and attach your patch. Any potential addition to
the codebase is welcomed.,
Thanks.
[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
[1] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java
--
Lewis