You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Something Something <ma...@gmail.com> on 2012/02/03 07:07:20 UTC

Pig/Avro Question

In my Pig script I have something like this...

%default MY_SCHEMA '/user/xyz/my-schema.json';

%default MY_AVRO
'org.apache.pig.piggybank.storage.avro.AvroStorage(\'$MY_SCHEMA\')';

my_files = LOAD '$MY_FILES' USING $MY_AVRO;



What I have noticed is that when MY_FILES contains only one file, it works
fine.

%default MY_FILES '/user/xyz/file1.avro'


But when I use a comma separated list it doesn't work. e.g.

%default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'

Basically, I get a message saying something like 'Schema cannot be found'.

Is there a way to make it work with multiple files?  Please let me know.
Thanks.

Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
Wow, thanks a ton!

On Fri, Feb 3, 2012 at 1:17 PM, Stan Rosenberg <
srosenberg@proclivitysystems.com> wrote:

> Check the code in PigAvroInputFormat; it overrides 'listStatus' from
> FileInputFormat so that files not ending
> in .avro are filtered.
>
> stan
>
> On Fri, Feb 3, 2012 at 1:58 PM, Russell Jurney <ru...@gmail.com>
> wrote:
> > btw - the weird thing is... I've read the code.  There isn't a filter for
> > .avro in there.  Does Hadoop, or Avro itself (not that I can see it is
> > involved) do so?
> >
> > On Fri, Feb 3, 2012 at 10:55 AM, Russell Jurney <
> russell.jurney@gmail.com>wrote:
> >
> >> Hmmm I applied it, but I still can't open files that don't end in .avro
> >>
> >> On Fri, Feb 3, 2012 at 2:23 AM, Philipp <ph...@metrigo.de>
> wrote:
> >>
> >>> This patch fixes this issue:
> >>>
> >>> https://issues.apache.org/**jira/browse/PIG-2492<
> https://issues.apache.org/jira/browse/PIG-2492>
> >>>
> >>>
> >>>
> >>> On 02/03/2012 07:22 AM, Russell Jurney wrote:
> >>>
> >>>> I have the same bug. I read the code... there is no obvious fix.  Arg.
> >>>>
> >>>> On Feb 2, 2012, at 10:07 PM, Something Something<mailinglists19@**
> >>>> gmail.com <ma...@gmail.com>>  wrote:
> >>>>
> >>>>  In my Pig script I have something like this...
> >>>>>
> >>>>> %default MY_SCHEMA '/user/xyz/my-schema.json';
> >>>>>
> >>>>> %default MY_AVRO 'org.apache.pig.piggybank.**
> >>>>> storage.avro.AvroStorage(\'$**MY_SCHEMA\')';
> >>>>>
> >>>>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
> >>>>>
> >>>>>
> >>>>>
> >>>>> What I have noticed is that when MY_FILES contains only one file, it
> >>>>> works fine.
> >>>>>
> >>>>> %default MY_FILES '/user/xyz/file1.avro'
> >>>>>
> >>>>>
> >>>>> But when I use a comma separated list it doesn't work. e.g.
> >>>>>
> >>>>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
> >>>>>
> >>>>> Basically, I get a message saying something like 'Schema cannot be
> >>>>> found'.
> >>>>>
> >>>>> Is there a way to make it work with multiple files?  Please let me
> >>>>> know.  Thanks.
> >>>>>
> >>>>>
> >>>
> >>
> >>
> >> --
> >> Russell Jurney
> >> twitter.com/rjurney
> >> russell.jurney@gmail.com
> >> datasyndrome.com
> >>
> >
> >
> >
> > --
> > Russell Jurney
> > twitter.com/rjurney
> > russell.jurney@gmail.com
> > datasyndrome.com
>



-- 
Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

Re: Pig/Avro Question

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.
Check the code in PigAvroInputFormat; it overrides 'listStatus' from
FileInputFormat so that files not ending
in .avro are filtered.

stan

On Fri, Feb 3, 2012 at 1:58 PM, Russell Jurney <ru...@gmail.com> wrote:
> btw - the weird thing is... I've read the code.  There isn't a filter for
> .avro in there.  Does Hadoop, or Avro itself (not that I can see it is
> involved) do so?
>
> On Fri, Feb 3, 2012 at 10:55 AM, Russell Jurney <ru...@gmail.com>wrote:
>
>> Hmmm I applied it, but I still can't open files that don't end in .avro
>>
>> On Fri, Feb 3, 2012 at 2:23 AM, Philipp <ph...@metrigo.de> wrote:
>>
>>> This patch fixes this issue:
>>>
>>> https://issues.apache.org/**jira/browse/PIG-2492<https://issues.apache.org/jira/browse/PIG-2492>
>>>
>>>
>>>
>>> On 02/03/2012 07:22 AM, Russell Jurney wrote:
>>>
>>>> I have the same bug. I read the code... there is no obvious fix.  Arg.
>>>>
>>>> On Feb 2, 2012, at 10:07 PM, Something Something<mailinglists19@**
>>>> gmail.com <ma...@gmail.com>>  wrote:
>>>>
>>>>  In my Pig script I have something like this...
>>>>>
>>>>> %default MY_SCHEMA '/user/xyz/my-schema.json';
>>>>>
>>>>> %default MY_AVRO 'org.apache.pig.piggybank.**
>>>>> storage.avro.AvroStorage(\'$**MY_SCHEMA\')';
>>>>>
>>>>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
>>>>>
>>>>>
>>>>>
>>>>> What I have noticed is that when MY_FILES contains only one file, it
>>>>> works fine.
>>>>>
>>>>> %default MY_FILES '/user/xyz/file1.avro'
>>>>>
>>>>>
>>>>> But when I use a comma separated list it doesn't work. e.g.
>>>>>
>>>>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
>>>>>
>>>>> Basically, I get a message saying something like 'Schema cannot be
>>>>> found'.
>>>>>
>>>>> Is there a way to make it work with multiple files?  Please let me
>>>>> know.  Thanks.
>>>>>
>>>>>
>>>
>>
>>
>> --
>> Russell Jurney
>> twitter.com/rjurney
>> russell.jurney@gmail.com
>> datasyndrome.com
>>
>
>
>
> --
> Russell Jurney
> twitter.com/rjurney
> russell.jurney@gmail.com
> datasyndrome.com

Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
btw - the weird thing is... I've read the code.  There isn't a filter for
.avro in there.  Does Hadoop, or Avro itself (not that I can see it is
involved) do so?

On Fri, Feb 3, 2012 at 10:55 AM, Russell Jurney <ru...@gmail.com>wrote:

> Hmmm I applied it, but I still can't open files that don't end in .avro
>
> On Fri, Feb 3, 2012 at 2:23 AM, Philipp <ph...@metrigo.de> wrote:
>
>> This patch fixes this issue:
>>
>> https://issues.apache.org/**jira/browse/PIG-2492<https://issues.apache.org/jira/browse/PIG-2492>
>>
>>
>>
>> On 02/03/2012 07:22 AM, Russell Jurney wrote:
>>
>>> I have the same bug. I read the code... there is no obvious fix.  Arg.
>>>
>>> On Feb 2, 2012, at 10:07 PM, Something Something<mailinglists19@**
>>> gmail.com <ma...@gmail.com>>  wrote:
>>>
>>>  In my Pig script I have something like this...
>>>>
>>>> %default MY_SCHEMA '/user/xyz/my-schema.json';
>>>>
>>>> %default MY_AVRO 'org.apache.pig.piggybank.**
>>>> storage.avro.AvroStorage(\'$**MY_SCHEMA\')';
>>>>
>>>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
>>>>
>>>>
>>>>
>>>> What I have noticed is that when MY_FILES contains only one file, it
>>>> works fine.
>>>>
>>>> %default MY_FILES '/user/xyz/file1.avro'
>>>>
>>>>
>>>> But when I use a comma separated list it doesn't work. e.g.
>>>>
>>>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
>>>>
>>>> Basically, I get a message saying something like 'Schema cannot be
>>>> found'.
>>>>
>>>> Is there a way to make it work with multiple files?  Please let me
>>>> know.  Thanks.
>>>>
>>>>
>>
>
>
> --
> Russell Jurney
> twitter.com/rjurney
> russell.jurney@gmail.com
> datasyndrome.com
>



-- 
Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
btw - the weird thing is... I've read the code.  There isn't a filter for
.avro in there.  Does Hadoop, or Avro itself (not that I can see it is
involved) do so?

On Fri, Feb 3, 2012 at 10:55 AM, Russell Jurney <ru...@gmail.com>wrote:

> Hmmm I applied it, but I still can't open files that don't end in .avro
>
> On Fri, Feb 3, 2012 at 2:23 AM, Philipp <ph...@metrigo.de> wrote:
>
>> This patch fixes this issue:
>>
>> https://issues.apache.org/**jira/browse/PIG-2492<https://issues.apache.org/jira/browse/PIG-2492>
>>
>>
>>
>> On 02/03/2012 07:22 AM, Russell Jurney wrote:
>>
>>> I have the same bug. I read the code... there is no obvious fix.  Arg.
>>>
>>> On Feb 2, 2012, at 10:07 PM, Something Something<mailinglists19@**
>>> gmail.com <ma...@gmail.com>>  wrote:
>>>
>>>  In my Pig script I have something like this...
>>>>
>>>> %default MY_SCHEMA '/user/xyz/my-schema.json';
>>>>
>>>> %default MY_AVRO 'org.apache.pig.piggybank.**
>>>> storage.avro.AvroStorage(\'$**MY_SCHEMA\')';
>>>>
>>>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
>>>>
>>>>
>>>>
>>>> What I have noticed is that when MY_FILES contains only one file, it
>>>> works fine.
>>>>
>>>> %default MY_FILES '/user/xyz/file1.avro'
>>>>
>>>>
>>>> But when I use a comma separated list it doesn't work. e.g.
>>>>
>>>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
>>>>
>>>> Basically, I get a message saying something like 'Schema cannot be
>>>> found'.
>>>>
>>>> Is there a way to make it work with multiple files?  Please let me
>>>> know.  Thanks.
>>>>
>>>>
>>
>
>
> --
> Russell Jurney
> twitter.com/rjurney
> russell.jurney@gmail.com
> datasyndrome.com
>



-- 
Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
Hmmm I applied it, but I still can't open files that don't end in .avro

On Fri, Feb 3, 2012 at 2:23 AM, Philipp <ph...@metrigo.de> wrote:

> This patch fixes this issue:
>
> https://issues.apache.org/**jira/browse/PIG-2492<https://issues.apache.org/jira/browse/PIG-2492>
>
>
>
> On 02/03/2012 07:22 AM, Russell Jurney wrote:
>
>> I have the same bug. I read the code... there is no obvious fix.  Arg.
>>
>> On Feb 2, 2012, at 10:07 PM, Something Something<mailinglists19@**
>> gmail.com <ma...@gmail.com>>  wrote:
>>
>>  In my Pig script I have something like this...
>>>
>>> %default MY_SCHEMA '/user/xyz/my-schema.json';
>>>
>>> %default MY_AVRO 'org.apache.pig.piggybank.**
>>> storage.avro.AvroStorage(\'$**MY_SCHEMA\')';
>>>
>>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
>>>
>>>
>>>
>>> What I have noticed is that when MY_FILES contains only one file, it
>>> works fine.
>>>
>>> %default MY_FILES '/user/xyz/file1.avro'
>>>
>>>
>>> But when I use a comma separated list it doesn't work. e.g.
>>>
>>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
>>>
>>> Basically, I get a message saying something like 'Schema cannot be
>>> found'.
>>>
>>> Is there a way to make it work with multiple files?  Please let me know.
>>>  Thanks.
>>>
>>>
>


-- 
Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
Hmmm I applied it, but I still can't open files that don't end in .avro

On Fri, Feb 3, 2012 at 2:23 AM, Philipp <ph...@metrigo.de> wrote:

> This patch fixes this issue:
>
> https://issues.apache.org/**jira/browse/PIG-2492<https://issues.apache.org/jira/browse/PIG-2492>
>
>
>
> On 02/03/2012 07:22 AM, Russell Jurney wrote:
>
>> I have the same bug. I read the code... there is no obvious fix.  Arg.
>>
>> On Feb 2, 2012, at 10:07 PM, Something Something<mailinglists19@**
>> gmail.com <ma...@gmail.com>>  wrote:
>>
>>  In my Pig script I have something like this...
>>>
>>> %default MY_SCHEMA '/user/xyz/my-schema.json';
>>>
>>> %default MY_AVRO 'org.apache.pig.piggybank.**
>>> storage.avro.AvroStorage(\'$**MY_SCHEMA\')';
>>>
>>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
>>>
>>>
>>>
>>> What I have noticed is that when MY_FILES contains only one file, it
>>> works fine.
>>>
>>> %default MY_FILES '/user/xyz/file1.avro'
>>>
>>>
>>> But when I use a comma separated list it doesn't work. e.g.
>>>
>>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
>>>
>>> Basically, I get a message saying something like 'Schema cannot be
>>> found'.
>>>
>>> Is there a way to make it work with multiple files?  Please let me know.
>>>  Thanks.
>>>
>>>
>


-- 
Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

Re: Pig/Avro Question

Posted by Philipp <ph...@metrigo.de>.
This patch fixes this issue:

https://issues.apache.org/jira/browse/PIG-2492


On 02/03/2012 07:22 AM, Russell Jurney wrote:
> I have the same bug. I read the code... there is no obvious fix.  Arg.
>
> On Feb 2, 2012, at 10:07 PM, Something Something<ma...@gmail.com>  wrote:
>
>> In my Pig script I have something like this...
>>
>> %default MY_SCHEMA '/user/xyz/my-schema.json';
>>
>> %default MY_AVRO 'org.apache.pig.piggybank.storage.avro.AvroStorage(\'$MY_SCHEMA\')';
>>
>> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
>>
>>
>>
>> What I have noticed is that when MY_FILES contains only one file, it works fine.
>>
>> %default MY_FILES '/user/xyz/file1.avro'
>>
>>
>> But when I use a comma separated list it doesn't work. e.g.
>>
>> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
>>
>> Basically, I get a message saying something like 'Schema cannot be found'.
>>
>> Is there a way to make it work with multiple files?  Please let me know.  Thanks.
>>


Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
I have the same bug. I read the code... there is no obvious fix.  Arg.

On Feb 2, 2012, at 10:07 PM, Something Something <ma...@gmail.com> wrote:

> In my Pig script I have something like this...
> 
> %default MY_SCHEMA '/user/xyz/my-schema.json';
> 
> %default MY_AVRO 'org.apache.pig.piggybank.storage.avro.AvroStorage(\'$MY_SCHEMA\')';
> 
> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
> 
> 
> 
> What I have noticed is that when MY_FILES contains only one file, it works fine.
> 
> %default MY_FILES '/user/xyz/file1.avro'
> 
> 
> But when I use a comma separated list it doesn't work. e.g.
> 
> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
> 
> Basically, I get a message saying something like 'Schema cannot be found'.
> 
> Is there a way to make it work with multiple files?  Please let me know.  Thanks.
> 

Re: Pig/Avro Question

Posted by Russell Jurney <ru...@gmail.com>.
I have the same bug. I read the code... there is no obvious fix.  Arg.

On Feb 2, 2012, at 10:07 PM, Something Something <ma...@gmail.com> wrote:

> In my Pig script I have something like this...
> 
> %default MY_SCHEMA '/user/xyz/my-schema.json';
> 
> %default MY_AVRO 'org.apache.pig.piggybank.storage.avro.AvroStorage(\'$MY_SCHEMA\')';
> 
> my_files = LOAD '$MY_FILES' USING $MY_AVRO;
> 
> 
> 
> What I have noticed is that when MY_FILES contains only one file, it works fine.
> 
> %default MY_FILES '/user/xyz/file1.avro'
> 
> 
> But when I use a comma separated list it doesn't work. e.g.
> 
> %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'
> 
> Basically, I get a message saying something like 'Schema cannot be found'.
> 
> Is there a way to make it work with multiple files?  Please let me know.  Thanks.
>