You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by "James R. Leek" <le...@llnl.gov> on 2009/11/24 08:47:38 UTC

Pig isn't reading my HDFS?

Hi, I seem to be having an odd problem with pig.  At least I haven't 
found any documentation on it.  I've been using hadoop 20.1 to do some 
parsing of my data, and I thought pig might be a good tool to process 
what comes out.  I got pig 0.5.0, which seemed to be working OK until I 
tried to read from my HDFS.  Pig only seems to be reading from my local 
file system.  (Well, actually it's NFS.)

Anyway, pig starts up and says:

2009-11-23 23:12:28,799 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
Connecting to hadoop file system at: file:///

but this is a lie.  When I try to access a file from it (my hdfs is 
mounted at /data/pig/dfs.  /data is a local drive on each node in the 
cluster) I get:

grunt> virus5 = load '/data/pig/dfs/virus_output/part-r-00000';
grunt> dump virus5;
2009-11-23 23:15:11,377 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer 
- MR plan size before optimization: 1
2009-11-23 23:15:11,377 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer 
- MR plan size after optimization: 1
2009-11-23 23:15:14,356 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2009-11-23 23:15:14,386 [main] INFO  
org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics 
with processName=JobTracker, sessionId= - already initialized
2009-11-23 23:15:14,402 [Thread-5] WARN  
org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2009-11-23 23:15:14,890 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2009-11-23 23:15:14,891 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2009-11-23 23:15:14,891 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map reduce job(s) failed!
2009-11-23 23:15:14,917 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed to produce result in: "file:/tmp/temp1663096198/tmp782307025"
2009-11-23 23:15:14,917 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2009-11-23 23:15:14,923 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 2100: file:/data/pig/dfs/virus_output/part-r-00000 does not exist.
Details at logfile: /home/leek2/pig_1259046747727.log

This does work with hadoop, however:

hadoop dfs -ls /data/pig/dfs/virus_output/part-r-00000
Found 1 items
-rw-r--r--   1 leek2 supergroup 1151360535 2009-11-23 15:58 
/data/pig/dfs/virus_output/part-r-00000


I can read from my local file system just fine though.  Pig does seem to 
be connecting to the hadoop cluster?  Does anyone know what I'm doing wrong?

Thanks,
Jim

Re: Pig isn't reading my HDFS?

Posted by James Leek <le...@llnl.gov>.

> Anyway, I'm getting this error from pig:
> 2009-11-24 07:11:46,585 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
> - Failed to produce result in: 
> "hdfs://tuson118:9100/tmp/temp2106131810/tmp-1474283808"
>
> This makes sense because there is no /tmp in my hdfs mount.  
> Everything starts from /data/pig/dfs, even inside hadoop.  I assume 
> this is some sort of configuration issue (shouldn't the hdfs 
> internally start from '/'?).  The documentation all seems to refer to 
> earlier version of hadoop.
Whoops.  Nevermind on this.  Turns out the problem was that my log 
directory was full.  I cleaned it out and moved it to a more reasonable 
spot, and that solved the problem.

Thanks,
Jim

Re: Pig isn't reading my HDFS?

Posted by Alan Gates <ga...@yahoo-inc.com>.

Rather than use Pig's wiki for documentation, the best documentation  
these days is in the Documents tab on the left side of page, with a  
section for each release.  If you find things in the documentation  
section confusing or inadequate please file JIRAs so we can make it  
better.  Thanks.

Alan.

On Nov 24, 2009, at 8:17 AM, James R. Leek wrote:

>
>> Seems like your pig is not finding the Hadoop configuration files  
>> in its classpath when it is firing.
>> Assuming that you have installed Hadoop-0.20.0 somewhere in your  
>> local fileSystem say <hadoop-20 installation directory>, please add  
>> the following to your classpath
>> export HADOOPDIR=<hadoop installation directory>/conf
>> export PIG_CLASSPATH=$PIG_HOME/pig.jar:$HADOOPDIR
> Thanks, that seems to have worked.  It never occured to me that the  
> configuration directory should go in the CLASSPATH.
>> This should solve you problem if all your configuration have been  
>> done as per instructions in pig wiki
>
> However, I am a bit confused about configuration files.  I can't  
> find anything on it in the pig wiki about it.  (I'm having trouble  
> navigating it, really.)  What documentation there is, always  
> mentions $PIG_HOME/conf, which doesn't seem to exist in Pig_0.5.0.   
> Neither does $HADOOP_HOME/conf/hadoop-site.xml seem to exist in  
> hadoop_20.1, which the hadoop wiki specifically mentions.
> Anyway, I'm getting this error from pig:
>
> 2009-11-24 07:11:46,585 [main] ERROR  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  
> - Failed to produce result in: "hdfs://tuson118:9100/tmp/ 
> temp2106131810/tmp-1474283808"
>
>
> This makes sense because there is no /tmp in my hdfs mount.   
> Everything starts from /data/pig/dfs, even inside hadoop.  I assume  
> this is some sort of configuration issue (shouldn't the hdfs  
> internally start from '/'?).  The documentation all seems to refer  
> to earlier version of hadoop.
>
> Jim

Re: Pig isn't reading my HDFS?

Posted by "James R. Leek" <le...@llnl.gov>.

> Seems like your pig is not finding the Hadoop configuration files in 
> its classpath when it is firing.
> Assuming that you have installed Hadoop-0.20.0 somewhere in your local 
> fileSystem say <hadoop-20 installation directory>, please add the 
> following to your classpath
> export HADOOPDIR=<hadoop installation directory>/conf
> export PIG_CLASSPATH=$PIG_HOME/pig.jar:$HADOOPDIR
Thanks, that seems to have worked.  It never occured to me that the 
configuration directory should go in the CLASSPATH.
> This should solve you problem if all your configuration have been done 
> as per instructions in pig wiki

However, I am a bit confused about configuration files.  I can't find 
anything on it in the pig wiki about it.  (I'm having trouble navigating 
it, really.)  What documentation there is, always mentions 
$PIG_HOME/conf, which doesn't seem to exist in Pig_0.5.0.  Neither does 
$HADOOP_HOME/conf/hadoop-site.xml seem to exist in hadoop_20.1, which 
the hadoop wiki specifically mentions. 

Anyway, I'm getting this error from pig:

2009-11-24 07:11:46,585 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed to produce result in: 
"hdfs://tuson118:9100/tmp/temp2106131810/tmp-1474283808"


This makes sense because there is no /tmp in my hdfs mount.  Everything 
starts from /data/pig/dfs, even inside hadoop.  I assume this is some 
sort of configuration issue (shouldn't the hdfs internally start from 
'/'?).  The documentation all seems to refer to earlier version of hadoop.

Jim

Re: Pig isn't reading my HDFS?

Posted by Pratyush Banerjee <pr...@aol.com>.

Hi James,

Seems like your pig is not finding the Hadoop configuration files in its 
classpath when it is firing.
Assuming that you have installed Hadoop-0.20.0 somewhere in your local 
fileSystem say <hadoop-20 installation directory>, please add the 
following to your classpath
export HADOOPDIR=<hadoop installation directory>/conf
export PIG_CLASSPATH=$PIG_HOME/pig.jar:$HADOOPDIR

This should solve you problem if all your configuration have been done 
as per instructions in pig wiki

Thanks and regards,
Pratyush

James R. Leek wrote:
> Hi, I seem to be having an odd problem with pig.  At least I haven't 
> found any documentation on it.  I've been using hadoop 20.1 to do some 
> parsing of my data, and I thought pig might be a good tool to process 
> what comes out.  I got pig 0.5.0, which seemed to be working OK until 
> I tried to read from my HDFS.  Pig only seems to be reading from my 
> local file system.  (Well, actually it's NFS.)
>
> Anyway, pig starts up and says:
>
> 2009-11-23 23:12:28,799 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
> Connecting to hadoop file system at: file:///
>
> but this is a lie.  When I try to access a file from it (my hdfs is 
> mounted at /data/pig/dfs.  /data is a local drive on each node in the 
> cluster) I get:
>
> grunt> virus5 = load '/data/pig/dfs/virus_output/part-r-00000';
> grunt> dump virus5;
> 2009-11-23 23:15:11,377 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer 
> - MR plan size before optimization: 1
> 2009-11-23 23:15:11,377 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer 
> - MR plan size after optimization: 1
> 2009-11-23 23:15:14,356 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
> - Setting up single store job
> 2009-11-23 23:15:14,386 [main] INFO  
> org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM 
> Metrics with processName=JobTracker, sessionId= - already initialized
> 2009-11-23 23:15:14,402 [Thread-5] WARN  
> org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for 
> parsing the arguments. Applications should implement Tool for the same.
> 2009-11-23 23:15:14,890 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
> - 0% complete
> 2009-11-23 23:15:14,891 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
> - 100% complete
> 2009-11-23 23:15:14,891 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
> - 1 map reduce job(s) failed!
> 2009-11-23 23:15:14,917 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
> - Failed to produce result in: "file:/tmp/temp1663096198/tmp782307025"
> 2009-11-23 23:15:14,917 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
> - Failed!
> 2009-11-23 23:15:14,923 [main] ERROR org.apache.pig.tools.grunt.Grunt 
> - ERROR 2100: file:/data/pig/dfs/virus_output/part-r-00000 does not 
> exist.
> Details at logfile: /home/leek2/pig_1259046747727.log
>
> This does work with hadoop, however:
>
> hadoop dfs -ls /data/pig/dfs/virus_output/part-r-00000
> Found 1 items
> -rw-r--r--   1 leek2 supergroup 1151360535 2009-11-23 15:58 
> /data/pig/dfs/virus_output/part-r-00000
>
>
> I can read from my local file system just fine though.  Pig does seem 
> to be connecting to the hadoop cluster?  Does anyone know what I'm 
> doing wrong?
>
> Thanks,
> Jim