You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by anupamk <an...@usc.edu> on 2014/03/14 19:37:57 UTC

IOException while parsing

Hi, 

I am fetching around 3000 links. I am able to fetch them successfully. But
when I try to parse them I get IOException as follows:





This is not at helpful to troubleshoot the problem. 

Has anyone else run into such problem while parsing ?


I am using Nutch-1.7


I am guessing it's because nutch tries to parse a truncated mp3 file and
fails ? I am right ?



--
View this message in context: http://lucene.472066.n3.nabble.com/IOException-while-parsing-tp4123696.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: IOException while parsing

Posted by anupamk <an...@usc.edu>.

I got to getting around the problem a while back ... Just wanted to update
the forum with my work-around, in case anyone else is looking for a
solution.

The Apparently memory was the root of the issue. I don't know the internals
of parse yet. I have not looked at the code, but it seems to me that the
parser tries to span threads proportional to the number of documents in the
parser's queue. Again, I m not sure if I am 100% correct or not. I am just
guessing this based on the error message in hadoop.log. 

The way I got it to work was to split the segments into smaller segments and
fetch and parse each smaller segment one by one. 

I split the segments using the  option. 
(source: http://wiki.apache.org/nutch/bin/nutch_mergesegs)

I would love to take a closer look at the parser soon and come back with a
better answer. But for now, this works and gets the job done.



--
View this message in context: http://lucene.472066.n3.nabble.com/IOException-while-parsing-tp4123696p4123739.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: IOException while parsing

Posted by anupamk <an...@usc.edu>.

Hi John,

Thanks for the hint. I checked hadoop.log and upon further investigation the
only suspicious entry I found was the following warning --



Can this be the cause of the IOException ? 

If so then what may the remedy be ?





--
View this message in context: http://lucene.472066.n3.nabble.com/IOException-while-parsing-tp4123696p4123720.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: IOException while parsing

Posted by John Lafitte <jl...@brandextract.com>.

Hi,

That looks like the console output, but have you looked in logs/hadoop.log
?  usually you will get more detail on your error from there including a
stack trace.


On Fri, Mar 14, 2014 at 1:37 PM, anupamk <an...@usc.edu> wrote:

> Hi,
>
> I am fetching around 3000 links. I am able to fetch them successfully. But
> when I try to parse them I get IOException as follows:
>
>
>
>
>
> This is not at helpful to troubleshoot the problem.
>
> Has anyone else run into such problem while parsing ?
>
>
> I am using Nutch-1.7
>
>
> I am guessing it's because nutch tries to parse a truncated mp3 file and
> fails ? I am right ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/IOException-while-parsing-tp4123696.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>