You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jonathan Holloway <jo...@gmail.com> on 2011/06/16 03:57:14 UTC

PigStorage and ElephantBird's JsonLoader - InputFormat

Hi all,

I was wondering whether somebody could explain how Pig deals with nested
directories of log files,
Something like:

/logs/2011-01-01/a.log
/logs/2011-01-01/b.log
/logs/2011-01-01/c.log

I'm pretty sure if I give a Pig script the /logs directory as input it will
successfully process all logs (a.log, b.log, c.log)
within that structure.

However, I'm seeing a discrepancy with JsonLoader in elephant bird, because
if I do the same thing then it errors with the following:

Backend error message
---------------------
java.io.IOException: Cannot open filename /logs/2011-01-01
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
        at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. Cannot open filename /logs/2011-01-01

java.io.IOException: Cannot open filename /logs/2011-01-01
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
        at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
        at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
================================================================================
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain],
exit code [2]

I think it returns a TextInputFormat currently, where PigStorage can handle
this because it returns a PigTextInputFormat
which uses the MapRedUtil.getAllFileRecursively() workaround for
MAPREDUCE-1577.

Can anybody confirm this is actually the case, and whether there's some sort
of workaround for it?

I'm using Pig 0.8.0, Apache Hadoop 0.20.2 and Oozie 3.0.0

Many thanks in advance,
Jon.

Re: PigStorage and ElephantBird's JsonLoader - InputFormat

Posted by Jonathan Holloway <jo...@gmail.com>.
Thanks Dmitriy, I extended and overrode to return a PigInputTextFormat, all
is fine for now as a workaround.

Cheers,
Jon.

On 16 June 2011 05:45, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Yep, that's the problem. I will make it use the pigtextinputformat instead.
> Did the same thing for Lzo but not the uncompressed version.
>
> On Jun 15, 2011, at 6:57 PM, Jonathan Holloway <
> jonathan.holloway@gmail.com> wrote:
>
> > Hi all,
> >
> > I was wondering whether somebody could explain how Pig deals with nested
> > directories of log files,
> > Something like:
> >
> > /logs/2011-01-01/a.log
> > /logs/2011-01-01/b.log
> > /logs/2011-01-01/c.log
> >
> > I'm pretty sure if I give a Pig script the /logs directory as input it
> will
> > successfully process all logs (a.log, b.log, c.log)
> > within that structure.
> >
> > However, I'm seeing a discrepancy with JsonLoader in elephant bird,
> because
> > if I do the same thing then it errors with the following:
> >
> > Backend error message
> > ---------------------
> > java.io.IOException: Cannot open filename /logs/2011-01-01
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
> >        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
> >        at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
> >        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
> >        at
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> > Pig Stack Trace
> > ---------------
> > ERROR 2998: Unhandled internal error. Cannot open filename
> /logs/2011-01-01
> >
> > java.io.IOException: Cannot open filename /logs/2011-01-01
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
> >        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
> >        at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
> >        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
> >        at
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >
> ================================================================================
> > Failing Oozie Launcher, Main class
> [org.apache.oozie.action.hadoop.PigMain],
> > exit code [2]
> >
> > I think it returns a TextInputFormat currently, where PigStorage can
> handle
> > this because it returns a PigTextInputFormat
> > which uses the MapRedUtil.getAllFileRecursively() workaround for
> > MAPREDUCE-1577.
> >
> > Can anybody confirm this is actually the case, and whether there's some
> sort
> > of workaround for it?
> >
> > I'm using Pig 0.8.0, Apache Hadoop 0.20.2 and Oozie 3.0.0
> >
> > Many thanks in advance,
> > Jon.
>

Re: PigStorage and ElephantBird's JsonLoader - InputFormat

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Yep, that's the problem. I will make it use the pigtextinputformat instead. Did the same thing for Lzo but not the uncompressed version. 

On Jun 15, 2011, at 6:57 PM, Jonathan Holloway <jo...@gmail.com> wrote:

> Hi all,
> 
> I was wondering whether somebody could explain how Pig deals with nested
> directories of log files,
> Something like:
> 
> /logs/2011-01-01/a.log
> /logs/2011-01-01/b.log
> /logs/2011-01-01/c.log
> 
> I'm pretty sure if I give a Pig script the /logs directory as input it will
> successfully process all logs (a.log, b.log, c.log)
> within that structure.
> 
> However, I'm seeing a discrepancy with JsonLoader in elephant bird, because
> if I do the same thing then it errors with the following:
> 
> Backend error message
> ---------------------
> java.io.IOException: Cannot open filename /logs/2011-01-01
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
>        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
>        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
>        at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> Pig Stack Trace
> ---------------
> ERROR 2998: Unhandled internal error. Cannot open filename /logs/2011-01-01
> 
> java.io.IOException: Cannot open filename /logs/2011-01-01
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
>        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
>        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
>        at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> ================================================================================
> Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain],
> exit code [2]
> 
> I think it returns a TextInputFormat currently, where PigStorage can handle
> this because it returns a PigTextInputFormat
> which uses the MapRedUtil.getAllFileRecursively() workaround for
> MAPREDUCE-1577.
> 
> Can anybody confirm this is actually the case, and whether there's some sort
> of workaround for it?
> 
> I'm using Pig 0.8.0, Apache Hadoop 0.20.2 and Oozie 3.0.0
> 
> Many thanks in advance,
> Jon.