You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by hadi <md...@gmail.com> on 2012/02/18 14:05:45 UTC

IOExeption when crawling with nutch in Fetching process

After one day crawling with nutch(version 1.4) ... at last i got the bad bad
below exception:

.
.
.

-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
.
.

.

I have 20 news site and input argument of nutch is : depth 3 and topN -1
I have enough space in root directory of my linux also i have a about 4GB of
ram.
and confige my nutch-site.xml as belows:
 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration>
    <property>
        <name>http.content.limit</name>
        <value>-1</value>     
    </property>

    <property>
        <name>file.content.limit</name>
        <value>-1</value>
    </property>

    <property>
        <name>file.content.ignored</name>
        <value>false</value>     
    </property>

    <property>
        <name>file.crawl.parent</name>
        <value>false</value>     
    </property>    

    <property>
        <name>http.agent.name</name>
        <value>My Nutch Spider</value>        
    </property> 

    <property>
        <name>encodingdetector.charset.min.confidence</name>
        <value>-1</value>      
    </property>      

    <property>
        <name>parser.timeout</name>
        <value>30</value>      
    </property>

    <property>
        <name>db.fetch.interval.default</name>
        <value>36000</value>       
    </property>

    <property>
        <name>db.fetch.schedule.class</name>
        <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>       
    </property>

    <property>
        <name>db.signature.class</name>
        <value>org.apache.nutch.crawl.TextProfileSignature</value>      
    </property>

    <property>
        <name>fetcher.verbose</name>
        <value>true</value>        
    </property>

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>      
    </property>

    <property>
        <name>http.timeout</name>
        <value>60000</value>
           </property>

    <property>
        <name>db.max.outlinks.per.page</name>
        <value>-1</value>
     </property>
    

    <property>
        <name>http.redirect.max</name>
        <value>5</value>       
    </property>

    <property>
        <name>db.fetch.interval.max</name>
        <value>7776000</value>      
    </property>

    <property>
        <name>db.max.anchor.length</name>
        <value>20000</value>       
  </description>
    </property>

<property>
  <name>hadoop.job.history.user.location</name>
  <value>/data/data_solr_site/hadoop-history-user</value> 
</property>

<property>
  <name>fetcher.threads.fetch</name>
  <value>5</value> 
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/data/data_solr_site/hadoop</value>  
</property>

</configuration>




how can i solve this issue? thanks.


--
View this message in context: http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3756272.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: IOExeption when crawling with nutch in Fetching process

Posted by remi tassing <ta...@gmail.com>.

Hi,

in my case, I had this issue when I inadvertently tempered the segment
files.

I had another similar issue but clearly different to yours because it
happened right before or after "inject". I realized my regex-urlfilter was
missing a "-" sign (bad syntax issue).

The reasons are probably different and you'll need to patiently debug this.
As I said try running a smaller crawl and see

Remi

On Sat, Feb 25, 2012 at 10:40 AM, hadi <md...@gmail.com> wrote:

> Hi remi
>
> Would you please tell me when this exception occur? is it depends on
> the type of urls or nutch configuration?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3774618.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: IOExeption when crawling with nutch in Fetching process

Posted by remi tassing <ta...@gmail.com>.

Another possibility might be the "tmp" memory[1]:

"The answer we find addressed the situation is that you're most likely out
of disk space in /tmp. Consider using another location, or possibly another
partition for hadoop.tmp.dir (which can be set in nutch-site.xml) with
plenty of room for large transient files or using a Hadoop cluster."

Remi

[1]:http://wiki.apache.org/nutch/NutchGotchas

On Sat, Feb 25, 2012 at 10:40 AM, hadi <md...@gmail.com> wrote:

> Hi remi
>
> Would you please tell me when this exception occur? is it depends on
> the type of urls or nutch configuration?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3774618.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: IOExeption when crawling with nutch in Fetching process

Posted by hadi <md...@gmail.com>.

Hi remi

Would you please tell me when this exception occur? is it depends on
the type of urls or nutch configuration?


--
View this message in context: http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3774618.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: IOExeption when crawling with nutch in Fetching process

Posted by remi tassing <ta...@gmail.com>.

Hey Hadi,

I had this error message several times, for different reasons but never
because of disk space.

I would suggest you run smaller crawls just to narrow down the issue. Start
with Top 1, then 10, ...

Remi

On Sunday, February 19, 2012, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Can you please paste how you have specified your hadoop temp dir. This
> seems to be the cause of such stack trace error's
>
> Thanks
>
> On Sun, Feb 19, 2012 at 7:04 AM, hadi <md...@gmail.com> wrote:
>
>> yes,there is a hadoop log :
>>
>>
>>
>> i search this error but everyone says this error is about low space but i
>> specify a large one
>>
>> --
>> View this message in context:
>>
http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3757564.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>

Re: IOExeption when crawling with nutch in Fetching process

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Can you please paste how you have specified your hadoop temp dir. This
seems to be the cause of such stack trace error's

Thanks

On Sun, Feb 19, 2012 at 7:04 AM, hadi <md...@gmail.com> wrote:

> yes,there is a hadoop log :
>
>
>
> i search this error but everyone says this error is about low space but i
> specify a large one
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3757564.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: IOExeption when crawling with nutch in Fetching process

Posted by hadi <md...@gmail.com>.

yes,there is a hadoop log :



i search this error but everyone says this error is about low space but i
specify a large one 

--
View this message in context: http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3757564.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: IOExeption when crawling with nutch in Fetching process

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Hadi,

On Sat, Feb 18, 2012 at 1:05 PM, hadi <md...@gmail.com> wrote:

> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
>

Is this the only log output you have? How are you running your crawls,
local or distributed?

Lewis