You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by brad <br...@bcs-mail.net> on 2010/09/29 21:08:07 UTC

Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

I have tried to move from a local instance of Nutch to a Pseudo-Distributed
Mode Hadoop Nutch on a single machine.  I set everything up using the How to
Setup Nutch (V1.1) and Hadoop instructions located here:
http://wiki.apache.org/nutch/NutchHadoopTutorial 

Then I moved all my relevant files to the HDFS using:

bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb
.

I then double checked the files moved ok using 

bin/hadoop dfs -ls /crawl_www/crawldb

And that worked fine
Found 1 items
drwxr-xr-x   - root supergroup          0 2010-09-28 13:14
/crawl_www/crawldb/current

I went all the way down to the file level and it appears the files exist
bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000

Found 2 items
-rw-r--r--   1 root supergroup 2375690617 2010-09-28 13:13
/crawl_www/crawldb/current/part-00000/data
-rw-r--r--   1 root supergroup   23784625 2010-09-28 13:14
/crawl_www/crawldb/current/part-00000/index

Also, when I use firefox to browse the hdfs filesystem using
localhost:50070, everything appears to work perfectly and I can see
everything.

But, when I try a basic test run of Nutch, I get the following:
bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000


INFO  crawl.Generator - Generator: starting at 2010-09-29 11:54:15
INFO  crawl.Generator - Generator: Selecting best-scoring urls due for
fetch.
INFO  crawl.Generator - Generator: filtering: true
INFO  crawl.Generator - Generator: normalizing: true
INFO  crawl.Generator - Generator: topN: 1000
ERROR crawl.Generator - Generator:
org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist:
hdfs://localhost:9000/user/root/crawl_www/crawldb/current


Did I miss on configuration step?  I believe I have checked and double
checked everything and it appears to look correct.

Any ideas?

Note: this is Nutch 1.2 on Centos Linux 5.5.

Thanks
Brad

RE: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

Posted by brad <br...@bcs-mail.net>.

Thanks.  I'll change it when I reconfigure the box. 

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Wednesday, September 29, 2010 2:01 PM
To: user@nutch.apache.org
Subject: Re: Error with Hadoop when moving from Local to HDFS
Pseudo-Distributed Mode...

On 2010-09-29 21:50, brad wrote:
> Thanks Andrzej.  It did not occur to me that the path would need to 
> change in my scripts.
>
> As for root, is it a risk, if I just using the box for testing?

No, but IMHO it's a bad habit. Later on you will want to move this to a
production env. and then a few hidden assumptions about being root and being
able to do stuff as you please may bite you... Examples: ulimit, ssh keys,
ports < 1024, file permissions, etc...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com

Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-09-29 21:50, brad wrote:
> Thanks Andrzej.  It did not occur to me that the path would need to change
> in my scripts.
>
> As for root, is it a risk, if I just using the box for testing?

No, but IMHO it's a bad habit. Later on you will want to move this to a 
production env. and then a few hidden assumptions about being root and 
being able to do stuff as you please may bite you... Examples: ulimit, 
ssh keys, ports < 1024, file permissions, etc...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

Posted by brad <br...@bcs-mail.net>.

Thanks Andrzej.  It did not occur to me that the path would need to change
in my scripts.

As for root, is it a risk, if I just using the box for testing?

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Wednesday, September 29, 2010 12:31 PM
To: user@nutch.apache.org
Subject: Re: Error with Hadoop when moving from Local to HDFS
Pseudo-Distributed Mode...

On 2010-09-29 21:08, brad wrote:
> I have tried to move from a local instance of Nutch to a 
> Pseudo-Distributed Mode Hadoop Nutch on a single machine.  I set 
> everything up using the How to Setup Nutch (V1.1) and Hadoop instructions
located here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> Then I moved all my relevant files to the HDFS using:
>
> bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb .
>
> I then double checked the files moved ok using
>
> bin/hadoop dfs -ls /crawl_www/crawldb
>
> And that worked fine
> Found 1 items
> drwxr-xr-x   - root supergroup          0 2010-09-28 13:14
> /crawl_www/crawldb/current
>
> I went all the way down to the file level and it appears the files 
> exist bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000
>
> Found 2 items
> -rw-r--r--   1 root supergroup 2375690617 2010-09-28 13:13
> /crawl_www/crawldb/current/part-00000/data
> -rw-r--r--   1 root supergroup   23784625 2010-09-28 13:14
> /crawl_www/crawldb/current/part-00000/index
>
> Also, when I use firefox to browse the hdfs filesystem using 
> localhost:50070, everything appears to work perfectly and I can see 
> everything.
>
> But, when I try a basic test run of Nutch, I get the following:
> bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000
>
>
> INFO  crawl.Generator - Generator: starting at 2010-09-29 11:54:15 
> INFO  crawl.Generator - Generator: Selecting best-scoring urls due for 
> fetch.
> INFO  crawl.Generator - Generator: filtering: true INFO  
> crawl.Generator - Generator: normalizing: true INFO  crawl.Generator - 
> Generator: topN: 1000 ERROR crawl.Generator - Generator:
> org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://localhost:9000/user/root/crawl_www/crawldb/current
>
>
> Did I miss on configuration step?  I believe I have checked and double 
> checked everything and it appears to look correct.
>
> Any ideas?

Yes - you missed the leading slash in your path. The cmd-lines that you
quote above use relative path (no leading slash) and Hadoop assumes it's in
your Hadoop home dir, which is /user/${whoami}

By the way, I would strongly advise against running Hadoop as root.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com

Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-09-29 21:08, brad wrote:
> I have tried to move from a local instance of Nutch to a Pseudo-Distributed
> Mode Hadoop Nutch on a single machine.  I set everything up using the How to
> Setup Nutch (V1.1) and Hadoop instructions located here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> Then I moved all my relevant files to the HDFS using:
>
> bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb
> .
>
> I then double checked the files moved ok using
>
> bin/hadoop dfs -ls /crawl_www/crawldb
>
> And that worked fine
> Found 1 items
> drwxr-xr-x   - root supergroup          0 2010-09-28 13:14
> /crawl_www/crawldb/current
>
> I went all the way down to the file level and it appears the files exist
> bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000
>
> Found 2 items
> -rw-r--r--   1 root supergroup 2375690617 2010-09-28 13:13
> /crawl_www/crawldb/current/part-00000/data
> -rw-r--r--   1 root supergroup   23784625 2010-09-28 13:14
> /crawl_www/crawldb/current/part-00000/index
>
> Also, when I use firefox to browse the hdfs filesystem using
> localhost:50070, everything appears to work perfectly and I can see
> everything.
>
> But, when I try a basic test run of Nutch, I get the following:
> bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000
>
>
> INFO  crawl.Generator - Generator: starting at 2010-09-29 11:54:15
> INFO  crawl.Generator - Generator: Selecting best-scoring urls due for
> fetch.
> INFO  crawl.Generator - Generator: filtering: true
> INFO  crawl.Generator - Generator: normalizing: true
> INFO  crawl.Generator - Generator: topN: 1000
> ERROR crawl.Generator - Generator:
> org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://localhost:9000/user/root/crawl_www/crawldb/current
>
>
> Did I miss on configuration step?  I believe I have checked and double
> checked everything and it appears to look correct.
>
> Any ideas?

Yes - you missed the leading slash in your path. The cmd-lines that you 
quote above use relative path (no leading slash) and Hadoop assumes it's 
in your Hadoop home dir, which is /user/${whoami}

By the way, I would strongly advise against running Hadoop as root.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

Posted by brad <br...@bcs-mail.net>.

If you mean did I run bin/start-all.sh, then yes.  If you mean something
else, then no.

I believe the hadoop daemon is running, since I can browse the hadoop
NameNode filesystem...

-----Original Message-----
From: Steve Cohen [mailto:mail4steve@gmail.com] 
Sent: Wednesday, September 29, 2010 12:17 PM
To: user@nutch.apache.org
Subject: Re: Error with Hadoop when moving from Local to HDFS
Pseudo-Distributed Mode...

Did you start up the hadoop daemon?

On Wed, Sep 29, 2010 at 3:08 PM, brad <br...@bcs-mail.net> wrote:

> I have tried to move from a local instance of Nutch to a 
> Pseudo-Distributed Mode Hadoop Nutch on a single machine.  I set 
> everything up using the How to Setup Nutch (V1.1) and Hadoop 
> instructions located here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> Then I moved all my relevant files to the HDFS using:
>
> bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb .
>
> I then double checked the files moved ok using
>
> bin/hadoop dfs -ls /crawl_www/crawldb
>
> And that worked fine
> Found 1 items
> drwxr-xr-x   - root supergroup          0 2010-09-28 13:14
> /crawl_www/crawldb/current
>
> I went all the way down to the file level and it appears the files 
> exist bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000
>
> Found 2 items
> -rw-r--r--   1 root supergroup 2375690617 2010-09-28 13:13
> /crawl_www/crawldb/current/part-00000/data
> -rw-r--r--   1 root supergroup   23784625 2010-09-28 13:14
> /crawl_www/crawldb/current/part-00000/index
>
> Also, when I use firefox to browse the hdfs filesystem using 
> localhost:50070, everything appears to work perfectly and I can see 
> everything.
>
> But, when I try a basic test run of Nutch, I get the following:
> bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000
>
>
> INFO  crawl.Generator - Generator: starting at 2010-09-29 11:54:15 
> INFO  crawl.Generator - Generator: Selecting best-scoring urls due for 
> fetch.
> INFO  crawl.Generator - Generator: filtering: true INFO  
> crawl.Generator - Generator: normalizing: true INFO  crawl.Generator - 
> Generator: topN: 1000 ERROR crawl.Generator - Generator:
> org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://localhost:9000/user/root/crawl_www/crawldb/current
>
>
> Did I miss on configuration step?  I believe I have checked and double 
> checked everything and it appears to look correct.
>
> Any ideas?
>
> Note: this is Nutch 1.2 on Centos Linux 5.5.
>
> Thanks
> Brad
>

Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

Posted by Steve Cohen <ma...@gmail.com>.

Did you start up the hadoop daemon?

On Wed, Sep 29, 2010 at 3:08 PM, brad <br...@bcs-mail.net> wrote:

> I have tried to move from a local instance of Nutch to a Pseudo-Distributed
> Mode Hadoop Nutch on a single machine.  I set everything up using the How
> to
> Setup Nutch (V1.1) and Hadoop instructions located here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> Then I moved all my relevant files to the HDFS using:
>
> bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb
> .
>
> I then double checked the files moved ok using
>
> bin/hadoop dfs -ls /crawl_www/crawldb
>
> And that worked fine
> Found 1 items
> drwxr-xr-x   - root supergroup          0 2010-09-28 13:14
> /crawl_www/crawldb/current
>
> I went all the way down to the file level and it appears the files exist
> bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000
>
> Found 2 items
> -rw-r--r--   1 root supergroup 2375690617 2010-09-28 13:13
> /crawl_www/crawldb/current/part-00000/data
> -rw-r--r--   1 root supergroup   23784625 2010-09-28 13:14
> /crawl_www/crawldb/current/part-00000/index
>
> Also, when I use firefox to browse the hdfs filesystem using
> localhost:50070, everything appears to work perfectly and I can see
> everything.
>
> But, when I try a basic test run of Nutch, I get the following:
> bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000
>
>
> INFO  crawl.Generator - Generator: starting at 2010-09-29 11:54:15
> INFO  crawl.Generator - Generator: Selecting best-scoring urls due for
> fetch.
> INFO  crawl.Generator - Generator: filtering: true
> INFO  crawl.Generator - Generator: normalizing: true
> INFO  crawl.Generator - Generator: topN: 1000
> ERROR crawl.Generator - Generator:
> org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://localhost:9000/user/root/crawl_www/crawldb/current
>
>
> Did I miss on configuration step?  I believe I have checked and double
> checked everything and it appears to look correct.
>
> Any ideas?
>
> Note: this is Nutch 1.2 on Centos Linux 5.5.
>
> Thanks
> Brad
>