You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Tim Sell <tr...@gmail.com> on 2013/01/07 20:56:30 UTC

JsonLoader schema field order shouldn't matter

When using JsonLoader with Pig 0.10.0

if I have an input.json file that looks like this:

{"date": "2007-08-25", "id": 16}
{"date": "2007-09-08", "id": 17}
{"date": "2007-09-15", "id": 18}

And I use

a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
DUMP a;

I get errors when it tries to force the date fields into an integer.

Shouldn't this work independent of the ordering of the schema fields?
Json writers generally don't make guarantees about the ordering.

One alternative (though annoying) would to be use elephant bird
instead, but I can't get that to compile against hadoop 2.0.0 and Pig
0.10.0.

~Tim

Re: JsonLoader schema field order shouldn't matter

Posted by Tim Sell <tr...@gmail.com>.

Hmm,
I was using pretty much the same setup and got errors complaining
about Counter being an interface when it expected a class.
I'll try again with the jars straight out of maven tomorrow. Thanks.

~T

On 7 January 2013 21:32, meghana narasimhan
<me...@gmail.com> wrote:
> Hi Tim,
>
> We are using elephant-bird 3.0.2 with hadoop-2.0.0-mr1-cdh4.1.1
> and pig-0.10.0-cdh4.1.1. We are using the jar available in the maven repo.
> Didnt have to build it out.
>
> - Meg
>
>
> On Mon, Jan 7, 2013 at 11:56 AM, Tim Sell <tr...@gmail.com> wrote:
>
>> When using JsonLoader with Pig 0.10.0
>>
>> if I have an input.json file that looks like this:
>>
>> {"date": "2007-08-25", "id": 16}
>> {"date": "2007-09-08", "id": 17}
>> {"date": "2007-09-15", "id": 18}
>>
>> And I use
>>
>> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
>> DUMP a;
>>
>> I get errors when it tries to force the date fields into an integer.
>>
>> Shouldn't this work independent of the ordering of the schema fields?
>> Json writers generally don't make guarantees about the ordering.
>>
>> One alternative (though annoying) would to be use elephant bird
>> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
>> 0.10.0.
>>
>> ~Tim
>>

Re: JsonLoader schema field order shouldn't matter

Posted by meghana narasimhan <me...@gmail.com>.

Hi Tim,

We are using elephant-bird 3.0.2 with hadoop-2.0.0-mr1-cdh4.1.1
and pig-0.10.0-cdh4.1.1. We are using the jar available in the maven repo.
Didnt have to build it out.

- Meg


On Mon, Jan 7, 2013 at 11:56 AM, Tim Sell <tr...@gmail.com> wrote:

> When using JsonLoader with Pig 0.10.0
>
> if I have an input.json file that looks like this:
>
> {"date": "2007-08-25", "id": 16}
> {"date": "2007-09-08", "id": 17}
> {"date": "2007-09-15", "id": 18}
>
> And I use
>
> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
> DUMP a;
>
> I get errors when it tries to force the date fields into an integer.
>
> Shouldn't this work independent of the ordering of the schema fields?
> Json writers generally don't make guarantees about the ordering.
>
> One alternative (though annoying) would to be use elephant bird
> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
> 0.10.0.
>
> ~Tim
>

Re: JsonLoader schema field order shouldn't matter

Posted by Ruslan Al-Fakikh <me...@gmail.com>.

Tim,

have you resolved the issue of using the elephant-bird with pig 0.10?

meghana,

I am using just the same configuration:
pig -version
Apache Pig version 0.10.0-cdh4.1.1 (rexported)
hadoop version
Hadoop 2.0.0-cdh4.1.1
and getting just the same error as Tim explained:
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.Counter, but class was expected

Can you please give an example of your Pig script? I am running it with the
following commands:
REGISTER elephant-bird-pig-3.0.2.jar;
inputData = LOAD 'sample_simple.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
DUMP inputData;

Thanks in advance


On Fri, Jan 11, 2013 at 7:35 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Tim, can you open a github issue with EB about compiling against 0.10?
> I think this is an easy fix.
>
>
> On Tue, Jan 8, 2013 at 9:38 AM, Alan Gates <ga...@hortonworks.com> wrote:
>
> > I would open a new JIRA, since 1914 is focussed on building an
> alternative
> > that discovers schema, while you are wanting to improve the existing one.
> >
> > Alan.
> >
> > On Jan 7, 2013, at 5:02 PM, Tim Sell wrote:
> >
> > > This seems like a bug to me. It makes it risky to work with JSON data
> > > generated by something other than Pig since the ordering might change.
> > > What do you think?
> > >
> > > I didn't see a bug for it in Jira, so would this (still open) one be
> > > the place to mention it? Or should I make a new one?
> > > https://issues.apache.org/jira/browse/PIG-1914
> > >
> > > ~T
> > >
> > >
> > > On 7 January 2013 20:24, Alan Gates <ga...@hortonworks.com> wrote:
> > >> Currently the JsonLoader does assume ordering of the fields.  It does
> > not do any name matching against the given schema to find the right
> field.
> > >>
> > >> Alan.
> > >>
> > >> On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:
> > >>
> > >>> When using JsonLoader with Pig 0.10.0
> > >>>
> > >>> if I have an input.json file that looks like this:
> > >>>
> > >>> {"date": "2007-08-25", "id": 16}
> > >>> {"date": "2007-09-08", "id": 17}
> > >>> {"date": "2007-09-15", "id": 18}
> > >>>
> > >>> And I use
> > >>>
> > >>> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
> > >>> DUMP a;
> > >>>
> > >>> I get errors when it tries to force the date fields into an integer.
> > >>>
> > >>> Shouldn't this work independent of the ordering of the schema fields?
> > >>> Json writers generally don't make guarantees about the ordering.
> > >>>
> > >>> One alternative (though annoying) would to be use elephant bird
> > >>> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
> > >>> 0.10.0.
> > >>>
> > >>> ~Tim
> > >>
> >
> >
>

Re: JsonLoader schema field order shouldn't matter

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Tim, can you open a github issue with EB about compiling against 0.10?
I think this is an easy fix.


On Tue, Jan 8, 2013 at 9:38 AM, Alan Gates <ga...@hortonworks.com> wrote:

> I would open a new JIRA, since 1914 is focussed on building an alternative
> that discovers schema, while you are wanting to improve the existing one.
>
> Alan.
>
> On Jan 7, 2013, at 5:02 PM, Tim Sell wrote:
>
> > This seems like a bug to me. It makes it risky to work with JSON data
> > generated by something other than Pig since the ordering might change.
> > What do you think?
> >
> > I didn't see a bug for it in Jira, so would this (still open) one be
> > the place to mention it? Or should I make a new one?
> > https://issues.apache.org/jira/browse/PIG-1914
> >
> > ~T
> >
> >
> > On 7 January 2013 20:24, Alan Gates <ga...@hortonworks.com> wrote:
> >> Currently the JsonLoader does assume ordering of the fields.  It does
> not do any name matching against the given schema to find the right field.
> >>
> >> Alan.
> >>
> >> On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:
> >>
> >>> When using JsonLoader with Pig 0.10.0
> >>>
> >>> if I have an input.json file that looks like this:
> >>>
> >>> {"date": "2007-08-25", "id": 16}
> >>> {"date": "2007-09-08", "id": 17}
> >>> {"date": "2007-09-15", "id": 18}
> >>>
> >>> And I use
> >>>
> >>> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
> >>> DUMP a;
> >>>
> >>> I get errors when it tries to force the date fields into an integer.
> >>>
> >>> Shouldn't this work independent of the ordering of the schema fields?
> >>> Json writers generally don't make guarantees about the ordering.
> >>>
> >>> One alternative (though annoying) would to be use elephant bird
> >>> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
> >>> 0.10.0.
> >>>
> >>> ~Tim
> >>
>
>

Re: JsonLoader schema field order shouldn't matter

Posted by Alan Gates <ga...@hortonworks.com>.

I would open a new JIRA, since 1914 is focussed on building an alternative that discovers schema, while you are wanting to improve the existing one.

Alan.

On Jan 7, 2013, at 5:02 PM, Tim Sell wrote:

> This seems like a bug to me. It makes it risky to work with JSON data
> generated by something other than Pig since the ordering might change.
> What do you think?
> 
> I didn't see a bug for it in Jira, so would this (still open) one be
> the place to mention it? Or should I make a new one?
> https://issues.apache.org/jira/browse/PIG-1914
> 
> ~T
> 
> 
> On 7 January 2013 20:24, Alan Gates <ga...@hortonworks.com> wrote:
>> Currently the JsonLoader does assume ordering of the fields.  It does not do any name matching against the given schema to find the right field.
>> 
>> Alan.
>> 
>> On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:
>> 
>>> When using JsonLoader with Pig 0.10.0
>>> 
>>> if I have an input.json file that looks like this:
>>> 
>>> {"date": "2007-08-25", "id": 16}
>>> {"date": "2007-09-08", "id": 17}
>>> {"date": "2007-09-15", "id": 18}
>>> 
>>> And I use
>>> 
>>> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
>>> DUMP a;
>>> 
>>> I get errors when it tries to force the date fields into an integer.
>>> 
>>> Shouldn't this work independent of the ordering of the schema fields?
>>> Json writers generally don't make guarantees about the ordering.
>>> 
>>> One alternative (though annoying) would to be use elephant bird
>>> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
>>> 0.10.0.
>>> 
>>> ~Tim
>>

Re: JsonLoader schema field order shouldn't matter

Posted by Tim Sell <tr...@gmail.com>.

This seems like a bug to me. It makes it risky to work with JSON data
generated by something other than Pig since the ordering might change.
What do you think?

I didn't see a bug for it in Jira, so would this (still open) one be
the place to mention it? Or should I make a new one?
https://issues.apache.org/jira/browse/PIG-1914

~T


On 7 January 2013 20:24, Alan Gates <ga...@hortonworks.com> wrote:
> Currently the JsonLoader does assume ordering of the fields.  It does not do any name matching against the given schema to find the right field.
>
> Alan.
>
> On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:
>
>> When using JsonLoader with Pig 0.10.0
>>
>> if I have an input.json file that looks like this:
>>
>> {"date": "2007-08-25", "id": 16}
>> {"date": "2007-09-08", "id": 17}
>> {"date": "2007-09-15", "id": 18}
>>
>> And I use
>>
>> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
>> DUMP a;
>>
>> I get errors when it tries to force the date fields into an integer.
>>
>> Shouldn't this work independent of the ordering of the schema fields?
>> Json writers generally don't make guarantees about the ordering.
>>
>> One alternative (though annoying) would to be use elephant bird
>> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
>> 0.10.0.
>>
>> ~Tim
>

Re: JsonLoader schema field order shouldn't matter

Posted by Alan Gates <ga...@hortonworks.com>.

Currently the JsonLoader does assume ordering of the fields.  It does not do any name matching against the given schema to find the right field.

Alan.

On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:

> When using JsonLoader with Pig 0.10.0
> 
> if I have an input.json file that looks like this:
> 
> {"date": "2007-08-25", "id": 16}
> {"date": "2007-09-08", "id": 17}
> {"date": "2007-09-15", "id": 18}
> 
> And I use
> 
> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
> DUMP a;
> 
> I get errors when it tries to force the date fields into an integer.
> 
> Shouldn't this work independent of the ordering of the schema fields?
> Json writers generally don't make guarantees about the ordering.
> 
> One alternative (though annoying) would to be use elephant bird
> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
> 0.10.0.
> 
> ~Tim