You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by adfel70 <ad...@gmail.com> on 2013/02/27 09:06:47 UTC

why is nutch2.1 trying to parse the same documnets again and again?

Hi
I'm using nutch 2.1 and hbase.
I perform my first crawl and see that nutch is trying to parse the same
files in different cycles.
after the first time I always get "different batch id (null)" on the already
parsed files, so I assume that parsing is not actually performed.
But the question is why nutch tries to parse these files at all?

Is this because its the only place where the test of whether the file has
already been parsed is performed?




--
View this message in context: http://lucene.472066.n3.nabble.com/why-is-nutch2-1-trying-to-parse-the-same-documnets-again-and-again-tp4043317.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: why is nutch2.1 trying to parse the same documnets again and again?

Posted by adfel70 <ad...@gmail.com>.

I've just started looking into nutch2.* code, after a year working with
nutch1.*.
I was very enthusiastic with the gora integration. 
Of course, its going to take time untill I'm familiar with 2.* code as well
as with 1.* code.

Anyway, i'll be glad to get on board.
Regarding the shouldProccess() issue, it annoys me too. Seems odd that in a
simple test case I get so many  "different batch id (null)" messages. 

regrading the second issue, it seems from my test that it does loop until it
reaches depth. 
Fetcher just finished with nothing new fetched, and the same with Parser but
there I get all those "different batch id (null)" messages.
Also, I didn't see where shouldStop is updated to true. But i've made just a
brief review. maybe I missed something.



--
View this message in context: http://lucene.472066.n3.nabble.com/why-is-nutch2-1-trying-to-parse-the-same-documnets-again-and-again-tp4043317p4043335.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: why is nutch2.1 trying to parse the same documnets again and again?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi

On Wednesday, February 27, 2013, adfel70 <ad...@gmail.com> wrote:
> Yes I looked at the code.
Great

> I saw that shouldProccess() check is performed on each file in the mapper.
> I've got used in nutch1.* to a method in which in each cycle only a set of
> urls is being processed.
> Is nutch2.* processing all the urls in each cycle and thus, this
> shouldProccess() is required to make sure the same file isn't parsed
twice?
Nothing is static in the nutch 2.x code. I make this statement with the
intention of communicating that if you have an itch and want to scratch it
then come on board and we can work on ensuring that shouldprocess() ensures
multiple/unnecessary parsing is not executed. We do not need this and even
if it is not a bug (which it might be) it is still a pain, and also
annoying me.

> Also, I see that there is a loop on depth parameter. So if the defined
depth
> is greater than the actual depth of the site I'm crawling, the loop will
> just go on until it reaches the defined depth

I would think not. We cannot force fetching of content which simply does
not exist however this being said we need to ensure that Nutch does not
misinterpret our desired intentions.
I am at ApacheCon and I am looking at Nutch code. It is 1am so I can try
and look at this tomorrow.

-- 
*Lewis*

Re: why is nutch2.1 trying to parse the same documnets again and again?

Posted by adfel70 <ad...@gmail.com>.

Yes I looked at the code.
I saw that shouldProccess() check is performed on each file in the mapper.
I've got used in nutch1.* to a method in which in each cycle only a set of
urls is being processed.
Is nutch2.* processing all the urls in each cycle and thus, this
shouldProccess() is required to make sure the same file isn't parsed twice?


Also, I see that there is a loop on depth parameter. So if the defined depth
is greater than the actual depth of the site I'm crawling, the loop will
just go on until it reaches the defined depth?



--
View this message in context: http://lucene.472066.n3.nabble.com/why-is-nutch2-1-trying-to-parse-the-same-documnets-again-and-again-tp4043317p4043323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: why is nutch2.1 trying to parse the same documnets again and again?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Have you looked at the java code?
I am curious (and confused) about this "different batch id (null)" logging
and want to either get rid of it... or better... make it more informative
which would address both of our concerns.
I would like not only to document this in the java code but also on the
nutch wiki.

On Wednesday, February 27, 2013, adfel70 <ad...@gmail.com> wrote:
> Hi
> I'm using nutch 2.1 and hbase.
> I perform my first crawl and see that nutch is trying to parse the same
> files in different cycles.
> after the first time I always get "different batch id (null)" on the
already
> parsed files, so I assume that parsing is not actually performed.
> But the question is why nutch tries to parse these files at all?
>
> Is this because its the only place where the test of whether the file has
> already been parsed is performed?
>
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/why-is-nutch2-1-trying-to-parse-the-same-documnets-again-and-again-tp4043317.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*