You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Iwan Cornelius <iw...@pixolut.com> on 2008/01/09 00:50:48 UTC

Problem running latest nutch release

Hi there,

I'm having problems running the latest release of nutch. I get the following
error when I try to crawl:

Fetcher: segment: crawl/segments/20080109183955
Fetcher: java.io.IOException: Target
/tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already exists
        at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
        at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
LocalFileSystem.java:55)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java
:834)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
LocalJobRunner.java:86)
        at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
LocalJobRunner.java:281)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:558)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)

If I manually remove the offending directory it works... sometimes.

Any help is appreciated.

Regards,
IWan

Re: Problem running latest nutch release

Posted by Iwan Cornelius <iw...@pixolut.com>.

There is a bug in 0.15.0 hadoop which has been fixed in 0.16.... see this
posting:
https://issues.apache.org/jira/browse/HADOOP-1642

Any chance of updating the version of hadoop that is used by nutch?



On 1/10/08, Iwan Cornelius <iw...@pixolut.com> wrote:
>
> I have included the property with 'false' attribute in hadoop-site.xml so
> it should be off.
>
>
> On 1/9/08, Dennis Kubes < kubes@apache.org> wrote:
> >
> > Are you running with speculative execution on?
> >
> > Dennis
> >
> > Iwan Cornelius wrote:
> > > Hi Susam,
> > >
> > > I get this error for both cases 1 and 2.
> > >
> > > I think it's due to running hadoop in local mode (ie single machine).
> > It
> > > seems it's always giving a jobid of 1. I've been using only a single
> > thread
> > > so i'm not sure why this is; then again I don't really understand how
> > the
> > > whole nutch/hadoop system works ...
> > >
> > > The weird thing is, sometimes the script (both yours and bin/nutch)
> > will run
> > > all the way through, sometimes for 1 or 2 "depths" of a crawl,
> > sometimes
> > > for the  injecting of urls. It's seemingly random.
> > >
> > > I've found nothing online to help out.
> > >
> > > Thanks Susam!
> > >
> > > On 1/9/08, Susam Pal <su...@gmail.com> wrote:
> > >> I haven't really worked with the latest trunk. But I am wondering if
> > ...
> > >>
> > >> 1. you get this error when you kill a crawl while it is running, i.e.
> > >> the unfinished crawl is killed and then start a new crawl
> > >>
> > >> 2. you get this error when you crawl using 'bin/nutch crawl' command
> > >> as well as the crawl script?
> > >>
> > >> Regards,
> > >> Susam Pal
> > >>
> > >>> Hi there,
> > >>>
> > >>> I'm having problems running the latest release of nutch. I get the
> > >> following
> > >>> error when I try to crawl:
> > >>>
> > >>> Fetcher: segment: crawl/segments/20080109183955
> > >>> Fetcher: java.io.IOException: Target
> > >>> /tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already
> > exists
> > >>>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java
> > :246)
> > >>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
> > >>>         at org.apache.hadoop.fs.FileUtil.copy (FileUtil.java:116)
> > >>>         at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
> > >>> LocalFileSystem.java:55)
> > >>>         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(
> > >> FileSystem.java
> > >>> :834)
> > >>>         at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
> > >>> LocalJobRunner.java:86)
> > >>>         at org.apache.hadoop.mapred.LocalJobRunner.submitJob (
> > >>> LocalJobRunner.java:281)
> > >>>         at org.apache.hadoop.mapred.JobClient.submitJob(
> > JobClient.java
> > >> :558)
> > >>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> > >>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
> > >>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
> > >>>         at org.apache.hadoop.util.ToolRunner.run (ToolRunner.java
> > :65)
> > >>>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
> > >>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)
> > >>>
> > >>> If I manually remove the offending directory it works... sometimes.
> > >>>
> > >>> Any help is appreciated.
> > >>>
> > >>> Regards,
> > >>> IWan
> > >>>
> > >
> >
>
>

Re: Problem running latest nutch release

Posted by Iwan Cornelius <iw...@pixolut.com>.

I have included the property with 'false' attribute in hadoop-site.xml so it
should be off.


On 1/9/08, Dennis Kubes <ku...@apache.org> wrote:
>
> Are you running with speculative execution on?
>
> Dennis
>
> Iwan Cornelius wrote:
> > Hi Susam,
> >
> > I get this error for both cases 1 and 2.
> >
> > I think it's due to running hadoop in local mode (ie single machine). It
> > seems it's always giving a jobid of 1. I've been using only a single
> thread
> > so i'm not sure why this is; then again I don't really understand how
> the
> > whole nutch/hadoop system works ...
> >
> > The weird thing is, sometimes the script (both yours and bin/nutch) will
> run
> > all the way through, sometimes for 1 or 2 "depths" of a crawl, sometimes
> > for the  injecting of urls. It's seemingly random.
> >
> > I've found nothing online to help out.
> >
> > Thanks Susam!
> >
> > On 1/9/08, Susam Pal <su...@gmail.com> wrote:
> >> I haven't really worked with the latest trunk. But I am wondering if
> ...
> >>
> >> 1. you get this error when you kill a crawl while it is running, i.e.
> >> the unfinished crawl is killed and then start a new crawl
> >>
> >> 2. you get this error when you crawl using 'bin/nutch crawl' command
> >> as well as the crawl script?
> >>
> >> Regards,
> >> Susam Pal
> >>
> >>> Hi there,
> >>>
> >>> I'm having problems running the latest release of nutch. I get the
> >> following
> >>> error when I try to crawl:
> >>>
> >>> Fetcher: segment: crawl/segments/20080109183955
> >>> Fetcher: java.io.IOException: Target
> >>> /tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already exists
> >>>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
> >>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
> >>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
> >>>         at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
> >>> LocalFileSystem.java:55)
> >>>         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(
> >> FileSystem.java
> >>> :834)
> >>>         at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
> >>> LocalJobRunner.java:86)
> >>>         at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
> >>> LocalJobRunner.java:281)
> >>>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
> >> :558)
> >>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
> :753)
> >>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
> >>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
> >>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
> >>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)
> >>>
> >>> If I manually remove the offending directory it works... sometimes.
> >>>
> >>> Any help is appreciated.
> >>>
> >>> Regards,
> >>> IWan
> >>>
> >
>

Re: Problem running latest nutch release

Posted by Iwan Cornelius <iw...@pixolut.com>.

My bad,the directory is now set correctly, but I STILL have the following
error, any ideas?

Exception in thread "main" java.io.IOException: Target
/home/usrname/tmp/hadoop-usrname/mapred/local/localRunner/job_local_1.xml
already exists


On 1/14/08, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Iwan Cornelius wrote:
> > OK I now have a related problem:
> >
> > I don't think the file conf/hadoop-site.xml is being read at all! I've
> > altered the hadoop.tmp.dir property below, but the tmp files are still
> going
> > in the default location.  I suspect the other property is not being set
> > either; hence it's still running with speculative execution on.
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> >
> > <property>
> > <name>hadoop.temp.dir</name>
>
> This should be "hadoop.tmp.dir" - without the middle "e".
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Problem running latest nutch release

Posted by Andrzej Bialecki <ab...@getopt.org>.

Iwan Cornelius wrote:
> OK I now have a related problem:
> 
> I don't think the file conf/hadoop-site.xml is being read at all! I've
> altered the hadoop.tmp.dir property below, but the tmp files are still going
> in the default location.  I suspect the other property is not being set
> either; hence it's still running with speculative execution on.
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> <name>hadoop.temp.dir</name>

This should be "hadoop.tmp.dir" - without the middle "e".


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Problem running latest nutch release

Posted by Iwan Cornelius <iw...@pixolut.com>.

OK I now have a related problem:

I don't think the file conf/hadoop-site.xml is being read at all! I've
altered the hadoop.tmp.dir property below, but the tmp files are still going
in the default location.  I suspect the other property is not being set
either; hence it's still running with speculative execution on.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>hadoop.temp.dir</name>
<value>/home/username/tmp/hadoop-username/</value>
<description></description>
</property>

<property>
<name>mapred.speculative.execution</name>
<value>false</value>
<description>If true, then multiple instances of some map tasks may be
executed in parallel.</description>
</property>

</configuration>



On 1/9/08, Dennis Kubes <ku...@apache.org> wrote:
>
> Are you running with speculative execution on?
>
> Dennis
>
> Iwan Cornelius wrote:
> > Hi Susam,
> >
> > I get this error for both cases 1 and 2.
> >
> > I think it's due to running hadoop in local mode (ie single machine). It
> > seems it's always giving a jobid of 1. I've been using only a single
> thread
> > so i'm not sure why this is; then again I don't really understand how
> the
> > whole nutch/hadoop system works ...
> >
> > The weird thing is, sometimes the script (both yours and bin/nutch) will
> run
> > all the way through, sometimes for 1 or 2 "depths" of a crawl, sometimes
> > for the  injecting of urls. It's seemingly random.
> >
> > I've found nothing online to help out.
> >
> > Thanks Susam!
> >
> > On 1/9/08, Susam Pal <su...@gmail.com> wrote:
> >> I haven't really worked with the latest trunk. But I am wondering if
> ...
> >>
> >> 1. you get this error when you kill a crawl while it is running, i.e.
> >> the unfinished crawl is killed and then start a new crawl
> >>
> >> 2. you get this error when you crawl using 'bin/nutch crawl' command
> >> as well as the crawl script?
> >>
> >> Regards,
> >> Susam Pal
> >>
> >>> Hi there,
> >>>
> >>> I'm having problems running the latest release of nutch. I get the
> >> following
> >>> error when I try to crawl:
> >>>
> >>> Fetcher: segment: crawl/segments/20080109183955
> >>> Fetcher: java.io.IOException: Target
> >>> /tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already exists
> >>>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
> >>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
> >>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
> >>>         at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
> >>> LocalFileSystem.java:55)
> >>>         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(
> >> FileSystem.java
> >>> :834)
> >>>         at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
> >>> LocalJobRunner.java:86)
> >>>         at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
> >>> LocalJobRunner.java:281)
> >>>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
> >> :558)
> >>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
> :753)
> >>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
> >>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
> >>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
> >>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)
> >>>
> >>> If I manually remove the offending directory it works... sometimes.
> >>>
> >>> Any help is appreciated.
> >>>
> >>> Regards,
> >>> IWan
> >>>
> >
>

Re: Problem running latest nutch release

Posted by Dennis Kubes <ku...@apache.org>.

Are you running with speculative execution on?

Dennis

Iwan Cornelius wrote:
> Hi Susam,
> 
> I get this error for both cases 1 and 2.
> 
> I think it's due to running hadoop in local mode (ie single machine). It
> seems it's always giving a jobid of 1. I've been using only a single thread
> so i'm not sure why this is; then again I don't really understand how the
> whole nutch/hadoop system works ...
> 
> The weird thing is, sometimes the script (both yours and bin/nutch) will run
> all the way through, sometimes for 1 or 2 "depths" of a crawl, sometimes
> for the  injecting of urls. It's seemingly random.
> 
> I've found nothing online to help out.
> 
> Thanks Susam!
> 
> On 1/9/08, Susam Pal <su...@gmail.com> wrote:
>> I haven't really worked with the latest trunk. But I am wondering if ...
>>
>> 1. you get this error when you kill a crawl while it is running, i.e.
>> the unfinished crawl is killed and then start a new crawl
>>
>> 2. you get this error when you crawl using 'bin/nutch crawl' command
>> as well as the crawl script?
>>
>> Regards,
>> Susam Pal
>>
>>> Hi there,
>>>
>>> I'm having problems running the latest release of nutch. I get the
>> following
>>> error when I try to crawl:
>>>
>>> Fetcher: segment: crawl/segments/20080109183955
>>> Fetcher: java.io.IOException: Target
>>> /tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already exists
>>>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>>>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>>>         at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
>>> LocalFileSystem.java:55)
>>>         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(
>> FileSystem.java
>>> :834)
>>>         at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
>>> LocalJobRunner.java:86)
>>>         at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
>>> LocalJobRunner.java:281)
>>>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
>> :558)
>>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
>>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
>>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)
>>>
>>> If I manually remove the offending directory it works... sometimes.
>>>
>>> Any help is appreciated.
>>>
>>> Regards,
>>> IWan
>>>
>

Re: Problem running latest nutch release

Posted by Iwan Cornelius <iw...@pixolut.com>.

Hi Susam,

I get this error for both cases 1 and 2.

I think it's due to running hadoop in local mode (ie single machine). It
seems it's always giving a jobid of 1. I've been using only a single thread
so i'm not sure why this is; then again I don't really understand how the
whole nutch/hadoop system works ...

The weird thing is, sometimes the script (both yours and bin/nutch) will run
all the way through, sometimes for 1 or 2 "depths" of a crawl, sometimes
for the  injecting of urls. It's seemingly random.

I've found nothing online to help out.

Thanks Susam!

On 1/9/08, Susam Pal <su...@gmail.com> wrote:
>
> I haven't really worked with the latest trunk. But I am wondering if ...
>
> 1. you get this error when you kill a crawl while it is running, i.e.
> the unfinished crawl is killed and then start a new crawl
>
> 2. you get this error when you crawl using 'bin/nutch crawl' command
> as well as the crawl script?
>
> Regards,
> Susam Pal
>
> > Hi there,
> >
> > I'm having problems running the latest release of nutch. I get the
> following
> > error when I try to crawl:
> >
> > Fetcher: segment: crawl/segments/20080109183955
> > Fetcher: java.io.IOException: Target
> > /tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already exists
> >         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
> >         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
> >         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
> >         at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
> > LocalFileSystem.java:55)
> >         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(
> FileSystem.java
> > :834)
> >         at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
> > LocalJobRunner.java:86)
> >         at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
> > LocalJobRunner.java:281)
> >         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
> :558)
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
> >         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
> >         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)
> >
> > If I manually remove the offending directory it works... sometimes.
> >
> > Any help is appreciated.
> >
> > Regards,
> > IWan
> >
>

Re: Problem running latest nutch release

Posted by Susam Pal <su...@gmail.com>.

I haven't really worked with the latest trunk. But I am wondering if ...

1. you get this error when you kill a crawl while it is running, i.e.
the unfinished crawl is killed and then start a new crawl

2. you get this error when you crawl using 'bin/nutch crawl' command
as well as the crawl script?

Regards,
Susam Pal

On Jan 9, 2008 5:20 AM, Iwan Cornelius <iw...@pixolut.com> wrote:
> Hi there,
>
> I'm having problems running the latest release of nutch. I get the following
> error when I try to crawl:
>
> Fetcher: segment: crawl/segments/20080109183955
> Fetcher: java.io.IOException: Target
> /tmp/hadoop-me/mapred/local/localRunner/job_local_1.xml already exists
>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>         at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(
> LocalFileSystem.java:55)
>         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java
> :834)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(
> LocalJobRunner.java:86)
>         at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
> LocalJobRunner.java:281)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:558)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:526)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:561)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:533)
>
> If I manually remove the offending directory it works... sometimes.
>
> Any help is appreciated.
>
> Regards,
> IWan
>