You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mohnish Kodnani <mo...@gmail.com> on 2012/09/25 21:43:57 UTC

HAR file and path globbing

Hi,
I am trying to give multiple paths to a pig script using path globbing in
HAR file format and it does not seem to work. I wanted to know if this is
expected or a bug / feature request.

Command :
x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');

This gives error due to the curly braces being encoded to %7B and %7D.
I am trying this on Pig 0.8.0

ERROR 2017: Internal error creating job configuration.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
open iterator for alias blah
        at org.apache.pig.PigServer.openIterator(PigServer.java:765)
        at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
        at org.apache.pig.Main.run(Main.java:455)
        at org.apache.pig.Main.main(Main.java:107)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias
blah
        at org.apache.pig.PigServer.storeEx(PigServer.java:889)
        at org.apache.pig.PigServer.store(PigServer.java:827)
        at org.apache.pig.PigServer.openIterator(PigServer.java:739)
        ... 7 more
Caused by:
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
ERROR 2017: Internal error creating job configuration.
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
        at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
        at
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
        at org.apache.pig.PigServer.storeEx(PigServer.java:885)
        ... 9 more
Caused by: java.io.IOException: Invalid path for the Har Filesystem.
har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
        at
org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
        at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
        at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
        at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
        at
org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
        ... 14 more

Re: HAR file and path globbing

Posted by Mohnish Kodnani <mo...@gmail.com>.

It would seem that when there is a wildcard in the last location in the
file path and when using Har file protocol, the combined paths are 0.
I get this when trying out the below given example .
2012-09-27 09:22:28,074 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: LIMIT
2012-09-27 09:22:28,074 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-09-27 09:22:28,147 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: x:
Store(hdfs://nn/tmp/temp1300843291/tmp-1282091819:org.apache.pig.impl.io.InterStorage)
- scope-2 Operator Key: scope-2)
2012-09-27 09:22:28,155 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-09-27 09:22:28,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-09-27 09:22:28,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-09-27 09:22:28,268 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-09-27 09:22:28,280 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-09-27 09:22:30,055 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-09-27 09:22:30,096 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-09-27 09:22:30,597 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-09-27 09:22:46,428 [Thread-6] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : *21667*
2012-09-27 09:22:46,431 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : *21667*
2012-09-27 09:22:46,440 [Thread-6] INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2012-09-27 09:22:46,443 [Thread-6] INFO
com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 335fea4fecb385745e9a6f2de174a5b26fbc6cae]
2012-09-27 09:24:04,257 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - *Total
input paths (combined) to process : 0*

It seems that MapRedUtil returns 0 paths to process after it tries to find
proper splits.


On Thu, Sep 27, 2012 at 9:30 AM, Mohnish Kodnani
<mo...@gmail.com>wrote:

> Any ideas on how I can figure out where things are not working or is this
> expected behavior.
> new observation :
>
> 1. It seems Path Globbing does not work with HAR Files with Pig, is this
> intentional ? For example :
> hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files
> in both har files. If I give the same path as input path to a pig script
> it does not seem to work.
>
> 2. Wildcards in HAR path.
>     Like the above example if I do the following on hadoop fs it works
> hadoop fs -ls har://x/y/*/a.har/*
> This lists all files from all folders that have a.har
>
> If I give the path input path to pig it does not work. I have tried these
> 2 things on pig 0.8
> Also, for the second use case. If I remove the last wild card where files
> should be, then it works.
> For example input path to pig :
> har://x/y/*/a.har/logFile
>
> then pig can read the file and give me records back, but wild card at the
> last location does not work.
>
> Any insights would be great around if this should or should not work. I
> have 30000 files in one folder inside the har, I cannot list each one and
> want to use wildcard as the last element in the path and use path globbing
> to provide multiple har files.
>
>
> thanks
> mohnish
>
>
> On Wed, Sep 26, 2012 at 10:44 AM, Mohnish Kodnani <
> mohnish.kodnani@gmail.com> wrote:
>
>> I think its pig related because if i do hadoop fs -ls on the har file
>> path with input globbing it works fine.
>>
>>
>> On Tue, Sep 25, 2012 at 7:45 PM, Cheolsoo Park <ch...@cloudera.com>wrote:
>>
>>> Sounds like I was wrong. ;-)
>>>
>>> You might get a better answer from hadoop user group since this is more
>>> related to HarFileSystem than Pig I think.
>>>
>>> Thanks,
>>> Cheolsoo
>>>
>>> On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani
>>> <mo...@gmail.com>wrote:
>>>
>>> > Hi Chelsoo,
>>> > thanks for replying. On the same system the following works :
>>> >
>>> > x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
>>> > PigStorage('\t');
>>> >
>>> > Two separate file paths with har protocol work.
>>> >
>>> > A single path works but if I do the following I get an error.
>>> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>>> >
>>> > Thanks
>>> > Mohnish
>>> >
>>> > On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <cheolsoo@cloudera.com
>>> > >wrote:
>>> >
>>> > > Hi Mohnish,
>>> > >
>>> > > I am not very familiar with har files, so I might be wrong here.
>>> > >
>>> > > Looking at the call stack, the exception is thrown from
>>> initialize(URI
>>> > > name, Configuration conf) in HarFileSystem.java. In the source code,
>>> the
>>> > > comment of this method says the following:
>>> > >
>>> > > Initialize a Har filesystem per har archive. The
>>> > > > archive home directory is the top level directory
>>> > > > in the filesystem that contains the HAR archive.
>>> > >
>>> > >
>>> > > This sounds to me that HarFileSystem expects a single path.
>>> > >
>>> > >
>>> > > This gives error due to the curly braces being encoded to %7B and
>>> %7D.
>>> > >
>>> > >
>>> > > The encoded curly braces should be fine though. In fact, if they're
>>> not
>>> > > encoded, that's a problem because then a URISyntaxException will be
>>> > thrown
>>> > > by Java URI class.
>>> > >
>>> > > Hope that this helps,
>>> > > Cheolsoo
>>> > >
>>> > >
>>> > > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
>>> > > mohnish.kodnani@gmail.com
>>> > > > wrote:
>>> > >
>>> > > > Hi,
>>> > > > I am trying to give multiple paths to a pig script using path
>>> globbing
>>> > in
>>> > > > HAR file format and it does not seem to work. I wanted to know if
>>> this
>>> > is
>>> > > > expected or a bug / feature request.
>>> > > >
>>> > > > Command :
>>> > > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using
>>> PigStorage('\t');
>>> > > >
>>> > > > This gives error due to the curly braces being encoded to %7B and
>>> %7D.
>>> > > > I am trying this on Pig 0.8.0
>>> > > >
>>> > > > ERROR 2017: Internal error creating job configuration.
>>> > > >
>>> > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066:
>>> Unable
>>> > to
>>> > > > open iterator for alias blah
>>> > > >         at
>>> org.apache.pig.PigServer.openIterator(PigServer.java:765)
>>> > > >         at
>>> > > >
>>> >
>>> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>>> > > >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
>>> > > >         at org.apache.pig.Main.run(Main.java:455)
>>> > > >         at org.apache.pig.Main.main(Main.java:107)
>>> > > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store
>>> > alias
>>> > > > blah
>>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
>>> > > >         at org.apache.pig.PigServer.store(PigServer.java:827)
>>> > > >         at
>>> org.apache.pig.PigServer.openIterator(PigServer.java:739)
>>> > > >         ... 7 more
>>> > > > Caused by:
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
>>> > > > ERROR 2017: Internal error creating job configuration.
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
>>> > > >         at
>>> > > >
>>> >
>>> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
>>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
>>> > > >         ... 9 more
>>> > > > Caused by: java.io.IOException: Invalid path for the Har
>>> Filesystem.
>>> > > >
>>> > > >
>>> > >
>>> >
>>> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
>>> > > >         at
>>> > > >
>>> org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
>>> > > >         at
>>> > > >
>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
>>> > > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
>>> > > >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
>>> > > >         at
>>> > > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
>>> > > >         at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
>>> > > >         ... 14 more
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: HAR file and path globbing

Posted by Mohnish Kodnani <mo...@gmail.com>.

Any ideas on how I can figure out where things are not working or is this
expected behavior.
new observation :

1. It seems Path Globbing does not work with HAR Files with Pig, is this
intentional ? For example :
hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files in
both har files. If I give the same path as input path to a pig script it
does not seem to work.

2. Wildcards in HAR path.
    Like the above example if I do the following on hadoop fs it works
hadoop fs -ls har://x/y/*/a.har/*
This lists all files from all folders that have a.har

If I give the path input path to pig it does not work. I have tried these 2
things on pig 0.8
Also, for the second use case. If I remove the last wild card where files
should be, then it works.
For example input path to pig :
har://x/y/*/a.har/logFile

then pig can read the file and give me records back, but wild card at the
last location does not work.

Any insights would be great around if this should or should not work. I
have 30000 files in one folder inside the har, I cannot list each one and
want to use wildcard as the last element in the path and use path globbing
to provide multiple har files.


thanks
mohnish

On Wed, Sep 26, 2012 at 10:44 AM, Mohnish Kodnani <mohnish.kodnani@gmail.com
> wrote:

> I think its pig related because if i do hadoop fs -ls on the har file path
> with input globbing it works fine.
>
>
> On Tue, Sep 25, 2012 at 7:45 PM, Cheolsoo Park <ch...@cloudera.com>wrote:
>
>> Sounds like I was wrong. ;-)
>>
>> You might get a better answer from hadoop user group since this is more
>> related to HarFileSystem than Pig I think.
>>
>> Thanks,
>> Cheolsoo
>>
>> On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani
>> <mo...@gmail.com>wrote:
>>
>> > Hi Chelsoo,
>> > thanks for replying. On the same system the following works :
>> >
>> > x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
>> > PigStorage('\t');
>> >
>> > Two separate file paths with har protocol work.
>> >
>> > A single path works but if I do the following I get an error.
>> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>> >
>> > Thanks
>> > Mohnish
>> >
>> > On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <cheolsoo@cloudera.com
>> > >wrote:
>> >
>> > > Hi Mohnish,
>> > >
>> > > I am not very familiar with har files, so I might be wrong here.
>> > >
>> > > Looking at the call stack, the exception is thrown from initialize(URI
>> > > name, Configuration conf) in HarFileSystem.java. In the source code,
>> the
>> > > comment of this method says the following:
>> > >
>> > > Initialize a Har filesystem per har archive. The
>> > > > archive home directory is the top level directory
>> > > > in the filesystem that contains the HAR archive.
>> > >
>> > >
>> > > This sounds to me that HarFileSystem expects a single path.
>> > >
>> > >
>> > > This gives error due to the curly braces being encoded to %7B and %7D.
>> > >
>> > >
>> > > The encoded curly braces should be fine though. In fact, if they're
>> not
>> > > encoded, that's a problem because then a URISyntaxException will be
>> > thrown
>> > > by Java URI class.
>> > >
>> > > Hope that this helps,
>> > > Cheolsoo
>> > >
>> > >
>> > > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
>> > > mohnish.kodnani@gmail.com
>> > > > wrote:
>> > >
>> > > > Hi,
>> > > > I am trying to give multiple paths to a pig script using path
>> globbing
>> > in
>> > > > HAR file format and it does not seem to work. I wanted to know if
>> this
>> > is
>> > > > expected or a bug / feature request.
>> > > >
>> > > > Command :
>> > > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>> > > >
>> > > > This gives error due to the curly braces being encoded to %7B and
>> %7D.
>> > > > I am trying this on Pig 0.8.0
>> > > >
>> > > > ERROR 2017: Internal error creating job configuration.
>> > > >
>> > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066:
>> Unable
>> > to
>> > > > open iterator for alias blah
>> > > >         at org.apache.pig.PigServer.openIterator(PigServer.java:765)
>> > > >         at
>> > > >
>> > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>> > > >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
>> > > >         at org.apache.pig.Main.run(Main.java:455)
>> > > >         at org.apache.pig.Main.main(Main.java:107)
>> > > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store
>> > alias
>> > > > blah
>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
>> > > >         at org.apache.pig.PigServer.store(PigServer.java:827)
>> > > >         at org.apache.pig.PigServer.openIterator(PigServer.java:739)
>> > > >         ... 7 more
>> > > > Caused by:
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
>> > > > ERROR 2017: Internal error creating job configuration.
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
>> > > >         at
>> > > >
>> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
>> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
>> > > >         ... 9 more
>> > > > Caused by: java.io.IOException: Invalid path for the Har Filesystem.
>> > > >
>> > > >
>> > >
>> >
>> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
>> > > >         at
>> > > >
>> org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
>> > > >         at
>> > > >
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
>> > > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
>> > > >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
>> > > >         at
>> > > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
>> > > >         at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
>> > > >         ... 14 more
>> > > >
>> > >
>> >
>>
>
>

Re: HAR file and path globbing

Posted by Mohnish Kodnani <mo...@gmail.com>.

I think its pig related because if i do hadoop fs -ls on the har file path
with input globbing it works fine.


On Tue, Sep 25, 2012 at 7:45 PM, Cheolsoo Park <ch...@cloudera.com>wrote:

> Sounds like I was wrong. ;-)
>
> You might get a better answer from hadoop user group since this is more
> related to HarFileSystem than Pig I think.
>
> Thanks,
> Cheolsoo
>
> On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani
> <mo...@gmail.com>wrote:
>
> > Hi Chelsoo,
> > thanks for replying. On the same system the following works :
> >
> > x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
> > PigStorage('\t');
> >
> > Two separate file paths with har protocol work.
> >
> > A single path works but if I do the following I get an error.
> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
> >
> > Thanks
> > Mohnish
> >
> > On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <cheolsoo@cloudera.com
> > >wrote:
> >
> > > Hi Mohnish,
> > >
> > > I am not very familiar with har files, so I might be wrong here.
> > >
> > > Looking at the call stack, the exception is thrown from initialize(URI
> > > name, Configuration conf) in HarFileSystem.java. In the source code,
> the
> > > comment of this method says the following:
> > >
> > > Initialize a Har filesystem per har archive. The
> > > > archive home directory is the top level directory
> > > > in the filesystem that contains the HAR archive.
> > >
> > >
> > > This sounds to me that HarFileSystem expects a single path.
> > >
> > >
> > > This gives error due to the curly braces being encoded to %7B and %7D.
> > >
> > >
> > > The encoded curly braces should be fine though. In fact, if they're not
> > > encoded, that's a problem because then a URISyntaxException will be
> > thrown
> > > by Java URI class.
> > >
> > > Hope that this helps,
> > > Cheolsoo
> > >
> > >
> > > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
> > > mohnish.kodnani@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > > I am trying to give multiple paths to a pig script using path
> globbing
> > in
> > > > HAR file format and it does not seem to work. I wanted to know if
> this
> > is
> > > > expected or a bug / feature request.
> > > >
> > > > Command :
> > > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
> > > >
> > > > This gives error due to the curly braces being encoded to %7B and
> %7D.
> > > > I am trying this on Pig 0.8.0
> > > >
> > > > ERROR 2017: Internal error creating job configuration.
> > > >
> > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066:
> Unable
> > to
> > > > open iterator for alias blah
> > > >         at org.apache.pig.PigServer.openIterator(PigServer.java:765)
> > > >         at
> > > >
> > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> > > >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
> > > >         at org.apache.pig.Main.run(Main.java:455)
> > > >         at org.apache.pig.Main.main(Main.java:107)
> > > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store
> > alias
> > > > blah
> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
> > > >         at org.apache.pig.PigServer.store(PigServer.java:827)
> > > >         at org.apache.pig.PigServer.openIterator(PigServer.java:739)
> > > >         ... 7 more
> > > > Caused by:
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
> > > > ERROR 2017: Internal error creating job configuration.
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
> > > >         at
> > > >
> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
> > > >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
> > > >         ... 9 more
> > > > Caused by: java.io.IOException: Invalid path for the Har Filesystem.
> > > >
> > > >
> > >
> >
> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
> > > >         at
> > > > org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
> > > >         at
> > > >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
> > > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
> > > >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
> > > >         at
> > > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
> > > >         ... 14 more
> > > >
> > >
> >
>

Re: HAR file and path globbing

Posted by Cheolsoo Park <ch...@cloudera.com>.

Sounds like I was wrong. ;-)

You might get a better answer from hadoop user group since this is more
related to HarFileSystem than Pig I think.

Thanks,
Cheolsoo

On Tue, Sep 25, 2012 at 6:20 PM, Mohnish Kodnani
<mo...@gmail.com>wrote:

> Hi Chelsoo,
> thanks for replying. On the same system the following works :
>
> x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
> PigStorage('\t');
>
> Two separate file paths with har protocol work.
>
> A single path works but if I do the following I get an error.
> x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>
> Thanks
> Mohnish
>
> On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <cheolsoo@cloudera.com
> >wrote:
>
> > Hi Mohnish,
> >
> > I am not very familiar with har files, so I might be wrong here.
> >
> > Looking at the call stack, the exception is thrown from initialize(URI
> > name, Configuration conf) in HarFileSystem.java. In the source code, the
> > comment of this method says the following:
> >
> > Initialize a Har filesystem per har archive. The
> > > archive home directory is the top level directory
> > > in the filesystem that contains the HAR archive.
> >
> >
> > This sounds to me that HarFileSystem expects a single path.
> >
> >
> > This gives error due to the curly braces being encoded to %7B and %7D.
> >
> >
> > The encoded curly braces should be fine though. In fact, if they're not
> > encoded, that's a problem because then a URISyntaxException will be
> thrown
> > by Java URI class.
> >
> > Hope that this helps,
> > Cheolsoo
> >
> >
> > On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
> > mohnish.kodnani@gmail.com
> > > wrote:
> >
> > > Hi,
> > > I am trying to give multiple paths to a pig script using path globbing
> in
> > > HAR file format and it does not seem to work. I wanted to know if this
> is
> > > expected or a bug / feature request.
> > >
> > > Command :
> > > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
> > >
> > > This gives error due to the curly braces being encoded to %7B and %7D.
> > > I am trying this on Pig 0.8.0
> > >
> > > ERROR 2017: Internal error creating job configuration.
> > >
> > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable
> to
> > > open iterator for alias blah
> > >         at org.apache.pig.PigServer.openIterator(PigServer.java:765)
> > >         at
> > >
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
> > >         at
> > >
> > >
> >
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
> > >         at
> > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> > >         at
> > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> > >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
> > >         at org.apache.pig.Main.run(Main.java:455)
> > >         at org.apache.pig.Main.main(Main.java:107)
> > > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store
> alias
> > > blah
> > >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
> > >         at org.apache.pig.PigServer.store(PigServer.java:827)
> > >         at org.apache.pig.PigServer.openIterator(PigServer.java:739)
> > >         ... 7 more
> > > Caused by:
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
> > > ERROR 2017: Internal error creating job configuration.
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
> > >         at
> > >
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
> > >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
> > >         ... 9 more
> > > Caused by: java.io.IOException: Invalid path for the Har Filesystem.
> > >
> > >
> >
> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
> > >         at
> > > org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
> > >         at
> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
> > >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
> > >         at
> > > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
> > >         ... 14 more
> > >
> >
>

Re: HAR file and path globbing

Posted by Mohnish Kodnani <mo...@gmail.com>.

Hi Chelsoo,
thanks for replying. On the same system the following works :

x = load 'har:///a/b/b/22.har/00/*,har:///a/b/c/d/23.har/00/*' using
PigStorage('\t');

Two separate file paths with har protocol work.

A single path works but if I do the following I get an error.
x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');

Thanks
Mohnish

On Tue, Sep 25, 2012 at 6:09 PM, Cheolsoo Park <ch...@cloudera.com>wrote:

> Hi Mohnish,
>
> I am not very familiar with har files, so I might be wrong here.
>
> Looking at the call stack, the exception is thrown from initialize(URI
> name, Configuration conf) in HarFileSystem.java. In the source code, the
> comment of this method says the following:
>
> Initialize a Har filesystem per har archive. The
> > archive home directory is the top level directory
> > in the filesystem that contains the HAR archive.
>
>
> This sounds to me that HarFileSystem expects a single path.
>
>
> This gives error due to the curly braces being encoded to %7B and %7D.
>
>
> The encoded curly braces should be fine though. In fact, if they're not
> encoded, that's a problem because then a URISyntaxException will be thrown
> by Java URI class.
>
> Hope that this helps,
> Cheolsoo
>
>
> On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <
> mohnish.kodnani@gmail.com
> > wrote:
>
> > Hi,
> > I am trying to give multiple paths to a pig script using path globbing in
> > HAR file format and it does not seem to work. I wanted to know if this is
> > expected or a bug / feature request.
> >
> > Command :
> > x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
> >
> > This gives error due to the curly braces being encoded to %7B and %7D.
> > I am trying this on Pig 0.8.0
> >
> > ERROR 2017: Internal error creating job configuration.
> >
> > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
> > open iterator for alias blah
> >         at org.apache.pig.PigServer.openIterator(PigServer.java:765)
> >         at
> > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
> >         at
> >
> >
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
> >         at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> >         at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> >         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
> >         at org.apache.pig.Main.run(Main.java:455)
> >         at org.apache.pig.Main.main(Main.java:107)
> > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias
> > blah
> >         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
> >         at org.apache.pig.PigServer.store(PigServer.java:827)
> >         at org.apache.pig.PigServer.openIterator(PigServer.java:739)
> >         ... 7 more
> > Caused by:
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
> > ERROR 2017: Internal error creating job configuration.
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
> >         at
> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
> >         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
> >         ... 9 more
> > Caused by: java.io.IOException: Invalid path for the Har Filesystem.
> >
> >
> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
> >         at
> > org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
> >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
> >         at
> >
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
> >         at
> >
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
> >         at
> > org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
> >         ... 14 more
> >
>

Re: HAR file and path globbing

Posted by Cheolsoo Park <ch...@cloudera.com>.

Hi Mohnish,

I am not very familiar with har files, so I might be wrong here.

Looking at the call stack, the exception is thrown from initialize(URI
name, Configuration conf) in HarFileSystem.java. In the source code, the
comment of this method says the following:

Initialize a Har filesystem per har archive. The
> archive home directory is the top level directory
> in the filesystem that contains the HAR archive.


This sounds to me that HarFileSystem expects a single path.


This gives error due to the curly braces being encoded to %7B and %7D.


The encoded curly braces should be fine though. In fact, if they're not
encoded, that's a problem because then a URISyntaxException will be thrown
by Java URI class.

Hope that this helps,
Cheolsoo


On Tue, Sep 25, 2012 at 12:43 PM, Mohnish Kodnani <mohnish.kodnani@gmail.com
> wrote:

> Hi,
> I am trying to give multiple paths to a pig script using path globbing in
> HAR file format and it does not seem to work. I wanted to know if this is
> expected or a bug / feature request.
>
> Command :
> x = LOAD 'har:///a/b/c/{d.har,e.har}/z/ab/*' using PigStorage('\t');
>
> This gives error due to the curly braces being encoded to %7B and %7D.
> I am trying this on Pig 0.8.0
>
> ERROR 2017: Internal error creating job configuration.
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
> open iterator for alias blah
>         at org.apache.pig.PigServer.openIterator(PigServer.java:765)
>         at
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
>         at
>
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
>         at
>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>         at
>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
>         at org.apache.pig.Main.run(Main.java:455)
>         at org.apache.pig.Main.main(Main.java:107)
> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias
> blah
>         at org.apache.pig.PigServer.storeEx(PigServer.java:889)
>         at org.apache.pig.PigServer.store(PigServer.java:827)
>         at org.apache.pig.PigServer.openIterator(PigServer.java:739)
>         ... 7 more
> Caused by:
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
> ERROR 2017: Internal error creating job configuration.
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:679)
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
>         at
>
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
>         at
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
>         at org.apache.pig.PigServer.storeEx(PigServer.java:885)
>         ... 9 more
> Caused by: java.io.IOException: Invalid path for the Har Filesystem.
>
> har:///user/cronusapp/cassini_downsample_logs/prod/2012/09/%7B22.har,23.har%7D/00/*
>         at
> org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:100)
>         at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1563)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
>         at
>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:348)
>         at
>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:317)
>         at
> org.apache.pig.builtin.PigStorage.setLocation(PigStorage.java:219)
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:369)
>         ... 14 more
>