You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by B M D Gill <bm...@gmail.com> on 2011/11/13 01:08:01 UTC

How to handle optional fields in schema

I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need to make
a change to add additional fields to the log files that I run my pig jobs
on  and am wondering how do I handle this schema in pig.

My current inputs are tab separated fields that I input using the standard
pig storage function:

LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);

However some input files will now have additional fields f4, f5, f6 etc. at
the trailing edge of each line.  How do I set up the load function to
handle these optional fields?  Do I need to make changes to my logic to
deal with these fields possibly being empty or will Pig simply record their
value as null if they are absent?

Thanks to anyone who can share some insight.

Re: How to handle optional fields in schema

Posted by Brendan Gill <bm...@gmail.com>.

In case anyone missed it.  Amazon ElasticMapReduce updated the version of
hadoop & pig (0.9.1) they support now.  Here's their message:

We are pleased to announce a number of exciting new Amazon Elastic
MapReduce features. Starting today you can run your job flows using Hadoop
0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also
introduced the concept of AMI versions. You can now provide a specific AMI
version to use at job flow launch or specify that you would like to use our
“latest” AMI, ensuring that you are always using our most up-to-date
features. The following AMI versions are now available:

   - *Version 2.0:* Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2
   (Squeeze)
   - *Version 1.0:* Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3
   and 0.6, Debian 5.0 (Lenny)

You can specify an AMI version when launching a job flow in the Ruby CLI
using the *--ami-version* argument (note that you will have to download the
latest version of the Ruby
CLI<http://www.amazon.com/gp/r.html?R=3OH87L9Z4GU2L&C=15RT7ZEC4KT7J&H=WHYJGBFM9T7GGNTTQASFXD8AIEKA&T=C&U=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fentry.jspa%3FexternalID%3D2264>
):

*$ ./elastic-mapreduce --create --alive --name "Test AMI Versioning"
--ami-version 2.0 --num-instances 5 --instance-type m1.small*

Please visit the AMI
Versioning<http://www.amazon.com/gp/r.html?R=3OH87L9Z4GU2L&C=15RT7ZEC4KT7J&H=42GNM6QV6AWRYIHA4H0W7CDKKWOA&T=C&U=http%3A%2F%2Fdocs.amazonwebservices.com%2FElasticMapReduce%2Flatest%2FDeveloperGuide%2Findex.html%3FEnvironmentConfig_AMIVersion.html>
section
in the Elastic MapReduce Developer Guide for more information.

In addition, we are excited to announce support for running job flows in an
Amazon Virtual Private Cloud (Amazon VPC), making it easier for customers
to do the following:

   - *Process sensitive data *- Launching a job flow on Amazon VPC is
   similar to launching the job flow on a private network and provides
   additional tools, such as routing tables and Network ACLs, for defining who
   has access to the network. If you are processing sensitive data in your job
   flow, you may find these additional access control tools useful.
   - *Access resources on an internal network *- If your data is located on
   a private network, it may be impractical or undesirable to regularly upload
   that data into AWS for import into Amazon Elastic MapReduce, either
   because of the volume of data or because of its sensitive nature. Now you
   can launch your job flow on an Amazon VPC and connect to your data
   center directly through a VPN connection.

You can launch Amazon Elastic MapReduce job flows into your VPC through the
Ruby CLI by using the *--subnet* argument and specifying the subnet address
(note that you will have to download the latest version of the Ruby
CLI<http://www.amazon.com/gp/r.html?R=3OH87L9Z4GU2L&C=15RT7ZEC4KT7J&H=WHYJGBFM9T7GGNTTQASFXD8AIEKA&T=C&U=http%3A%2F%2Fdeveloper.amazonwebservices.com%2Fconnect%2Fentry.jspa%3FexternalID%3D2264>
):

*$ ./elastic-mapreduce --create --alive --subnet "subnet-identifier"*

Please visit the Running Job Flows on an Amazon
VPC<http://www.amazon.com/gp/r.html?R=3OH87L9Z4GU2L&C=15RT7ZEC4KT7J&H=DWVA367HHTEN5SKNFA7JFZ5B9DUA&T=C&U=http%3A%2F%2Fdocs.amazonwebservices.com%2FElasticMapReduce%2Flatest%2FDeveloperGuide%2Findex.html%3FEnvironmentConfig_VPC.html>
section
in the Elastic MapReduce Developer Guide for more information.

Sincerely,

The Amazon Elastic MapReduce Team

On Sun, Nov 13, 2011 at 7:15 PM, B M D Gill <bm...@gmail.com> wrote:

> Damn! Spent all day on this and just independently realized it was the
> illustrate command screwing things up for me.  You are both spot on, thanks.
>
>
>
> On Sun, Nov 13, 2011 at 6:43 PM, Ashutosh Chauhan <ha...@apache.org>wrote:
>
>> I bet you are doing illustrate in your pig script. That may have a
>> problem.
>> Just either do dump or store and your script should work fine.
>>
>> Ashutosh
>> On Sat, Nov 12, 2011 at 17:03, B M D Gill <bm...@gmail.com> wrote:
>>
>> > Thanks Dimitry, will mention it to Amazon for sure.
>> >
>> > That was the first thing I tried and it didn't seem to make it work.
>>  Not
>> > sure what I could be doing wrong.  I get an Index out of bound error
>> where
>> > the index corresponds to the first instance of the optional field.
>>  Here is
>> > the stack trace:
>> >
>> > Pig Stack Trace
>> > ---------------
>> > ERROR 2999: Unexpected internal error. Index: 29, Size: 29
>> >
>> > java.lang.IndexOutOfBoundsException: Index: 29, Size: 29
>> >  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>> > at java.util.ArrayList.get(ArrayList.java:322)
>> >  at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>> > at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
>> >  at
>> >
>> >
>> org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:427)
>> > at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
>> >  at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
>> > at
>> >
>> >
>> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:70)
>> >  at
>> >
>> >
>> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:72)
>> > at
>> >
>> >
>> org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:55)
>> >  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
>> > at
>> >
>> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:121)
>> >  at org.apache.pig.PigServer.getExamples(PigServer.java:731)
>> > at
>> >
>> >
>> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:557)
>> >  at
>> >
>> >
>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
>> > at
>> >
>> >
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
>> >  at
>> >
>> >
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
>> > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
>> >  at org.apache.pig.Main.main(Main.java:374)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >  at
>> >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > at
>> >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >  at java.lang.reflect.Method.invoke(Method.java:597)
>> > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> >
>> >
>> ================================================================================
>> >
>> >
>> >
>> > On Sun, Nov 13, 2011 at 12:30 AM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> >
>> > > If you change the load statement to "load '$input' as (f1, f2, f3, f4,
>> > > f5), f4 and f5 will be treated as null if they are absent in the raw
>> > > logs.
>> > >
>> > > If you start relying on Pig heavily, lobby Amazon to upgrade their
>> > > version of Pig (or at least provide both 0.6 and 0.9.1). At this
>> > > point, 0.6 is positively ancient. But the extra field behavior worked
>> > > that way then, too.
>> > >
>> > > D
>> > >
>> > > On Sat, Nov 12, 2011 at 4:08 PM, B M D Gill <bm...@gmail.com>
>> wrote:
>> > > > I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need
>> to
>> > > make
>> > > > a change to add additional fields to the log files that I run my pig
>> > jobs
>> > > > on  and am wondering how do I handle this schema in pig.
>> > > >
>> > > > My current inputs are tab separated fields that I input using the
>> > > standard
>> > > > pig storage function:
>> > > >
>> > > > LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);
>> > > >
>> > > > However some input files will now have additional fields f4, f5, f6
>> > etc.
>> > > at
>> > > > the trailing edge of each line.  How do I set up the load function
>> to
>> > > > handle these optional fields?  Do I need to make changes to my
>> logic to
>> > > > deal with these fields possibly being empty or will Pig simply
>> record
>> > > their
>> > > > value as null if they are absent?
>> > > >
>> > > > Thanks to anyone who can share some insight.
>> > > >
>> > >
>> >
>>
>
>

Re: How to handle optional fields in schema

Posted by B M D Gill <bm...@gmail.com>.

Damn! Spent all day on this and just independently realized it was the
illustrate command screwing things up for me.  You are both spot on, thanks.



On Sun, Nov 13, 2011 at 6:43 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> I bet you are doing illustrate in your pig script. That may have a problem.
> Just either do dump or store and your script should work fine.
>
> Ashutosh
> On Sat, Nov 12, 2011 at 17:03, B M D Gill <bm...@gmail.com> wrote:
>
> > Thanks Dimitry, will mention it to Amazon for sure.
> >
> > That was the first thing I tried and it didn't seem to make it work.  Not
> > sure what I could be doing wrong.  I get an Index out of bound error
> where
> > the index corresponds to the first instance of the optional field.  Here
> is
> > the stack trace:
> >
> > Pig Stack Trace
> > ---------------
> > ERROR 2999: Unexpected internal error. Index: 29, Size: 29
> >
> > java.lang.IndexOutOfBoundsException: Index: 29, Size: 29
> >  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> > at java.util.ArrayList.get(ArrayList.java:322)
> >  at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
> > at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
> >  at
> >
> >
> org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:427)
> > at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
> >  at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
> > at
> >
> >
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:70)
> >  at
> >
> >
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:72)
> > at
> >
> >
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:55)
> >  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> > at
> >
> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:121)
> >  at org.apache.pig.PigServer.getExamples(PigServer.java:731)
> > at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:557)
> >  at
> >
> >
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
> > at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> >  at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> >  at org.apache.pig.Main.main(Main.java:374)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >
> >
> ================================================================================
> >
> >
> >
> > On Sun, Nov 13, 2011 at 12:30 AM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > If you change the load statement to "load '$input' as (f1, f2, f3, f4,
> > > f5), f4 and f5 will be treated as null if they are absent in the raw
> > > logs.
> > >
> > > If you start relying on Pig heavily, lobby Amazon to upgrade their
> > > version of Pig (or at least provide both 0.6 and 0.9.1). At this
> > > point, 0.6 is positively ancient. But the extra field behavior worked
> > > that way then, too.
> > >
> > > D
> > >
> > > On Sat, Nov 12, 2011 at 4:08 PM, B M D Gill <bm...@gmail.com> wrote:
> > > > I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need to
> > > make
> > > > a change to add additional fields to the log files that I run my pig
> > jobs
> > > > on  and am wondering how do I handle this schema in pig.
> > > >
> > > > My current inputs are tab separated fields that I input using the
> > > standard
> > > > pig storage function:
> > > >
> > > > LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);
> > > >
> > > > However some input files will now have additional fields f4, f5, f6
> > etc.
> > > at
> > > > the trailing edge of each line.  How do I set up the load function to
> > > > handle these optional fields?  Do I need to make changes to my logic
> to
> > > > deal with these fields possibly being empty or will Pig simply record
> > > their
> > > > value as null if they are absent?
> > > >
> > > > Thanks to anyone who can share some insight.
> > > >
> > >
> >
>

Re: How to handle optional fields in schema

Posted by Ashutosh Chauhan <ha...@apache.org>.

I bet you are doing illustrate in your pig script. That may have a problem.
Just either do dump or store and your script should work fine.

Ashutosh
On Sat, Nov 12, 2011 at 17:03, B M D Gill <bm...@gmail.com> wrote:

> Thanks Dimitry, will mention it to Amazon for sure.
>
> That was the first thing I tried and it didn't seem to make it work.  Not
> sure what I could be doing wrong.  I get an Index out of bound error where
> the index corresponds to the first instance of the optional field.  Here is
> the stack trace:
>
> Pig Stack Trace
> ---------------
> ERROR 2999: Unexpected internal error. Index: 29, Size: 29
>
> java.lang.IndexOutOfBoundsException: Index: 29, Size: 29
>  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
>  at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
> at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
>  at
>
> org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:427)
> at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
>  at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
> at
>
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:70)
>  at
>
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:72)
> at
>
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:55)
>  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at
> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:121)
>  at org.apache.pig.PigServer.getExamples(PigServer.java:731)
> at
>
> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:557)
>  at
>
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
> at
>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
>  at
>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
>  at org.apache.pig.Main.main(Main.java:374)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> ================================================================================
>
>
>
> On Sun, Nov 13, 2011 at 12:30 AM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > If you change the load statement to "load '$input' as (f1, f2, f3, f4,
> > f5), f4 and f5 will be treated as null if they are absent in the raw
> > logs.
> >
> > If you start relying on Pig heavily, lobby Amazon to upgrade their
> > version of Pig (or at least provide both 0.6 and 0.9.1). At this
> > point, 0.6 is positively ancient. But the extra field behavior worked
> > that way then, too.
> >
> > D
> >
> > On Sat, Nov 12, 2011 at 4:08 PM, B M D Gill <bm...@gmail.com> wrote:
> > > I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need to
> > make
> > > a change to add additional fields to the log files that I run my pig
> jobs
> > > on  and am wondering how do I handle this schema in pig.
> > >
> > > My current inputs are tab separated fields that I input using the
> > standard
> > > pig storage function:
> > >
> > > LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);
> > >
> > > However some input files will now have additional fields f4, f5, f6
> etc.
> > at
> > > the trailing edge of each line.  How do I set up the load function to
> > > handle these optional fields?  Do I need to make changes to my logic to
> > > deal with these fields possibly being empty or will Pig simply record
> > their
> > > value as null if they are absent?
> > >
> > > Thanks to anyone who can share some insight.
> > >
> >
>

Re: How to handle optional fields in schema

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

"Illustrate" is more or less non-functional in 0.6 (it had a glorious
return in 0.8).

Like I said... :-)

D

On Sat, Nov 12, 2011 at 5:03 PM, B M D Gill <bm...@gmail.com> wrote:
> Thanks Dimitry, will mention it to Amazon for sure.
>
> That was the first thing I tried and it didn't seem to make it work.  Not
> sure what I could be doing wrong.  I get an Index out of bound error where
> the index corresponds to the first instance of the optional field.  Here is
> the stack trace:
>
> Pig Stack Trace
> ---------------
> ERROR 2999: Unexpected internal error. Index: 29, Size: 29
>
> java.lang.IndexOutOfBoundsException: Index: 29, Size: 29
>  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
>  at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
> at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
>  at
> org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:427)
> at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
>  at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
> at
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:70)
>  at
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:72)
> at
> org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:55)
>  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at
> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:121)
>  at org.apache.pig.PigServer.getExamples(PigServer.java:731)
> at
> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:557)
>  at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
>  at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
>  at org.apache.pig.Main.main(Main.java:374)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> ================================================================================
>
>
>
> On Sun, Nov 13, 2011 at 12:30 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> If you change the load statement to "load '$input' as (f1, f2, f3, f4,
>> f5), f4 and f5 will be treated as null if they are absent in the raw
>> logs.
>>
>> If you start relying on Pig heavily, lobby Amazon to upgrade their
>> version of Pig (or at least provide both 0.6 and 0.9.1). At this
>> point, 0.6 is positively ancient. But the extra field behavior worked
>> that way then, too.
>>
>> D
>>
>> On Sat, Nov 12, 2011 at 4:08 PM, B M D Gill <bm...@gmail.com> wrote:
>> > I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need to
>> make
>> > a change to add additional fields to the log files that I run my pig jobs
>> > on  and am wondering how do I handle this schema in pig.
>> >
>> > My current inputs are tab separated fields that I input using the
>> standard
>> > pig storage function:
>> >
>> > LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);
>> >
>> > However some input files will now have additional fields f4, f5, f6 etc.
>> at
>> > the trailing edge of each line.  How do I set up the load function to
>> > handle these optional fields?  Do I need to make changes to my logic to
>> > deal with these fields possibly being empty or will Pig simply record
>> their
>> > value as null if they are absent?
>> >
>> > Thanks to anyone who can share some insight.
>> >
>>
>

Re: How to handle optional fields in schema

Posted by B M D Gill <bm...@gmail.com>.

Thanks Dimitry, will mention it to Amazon for sure.

That was the first thing I tried and it didn't seem to make it work.  Not
sure what I could be doing wrong.  I get an Index out of bound error where
the index corresponds to the first instance of the optional field.  Here is
the stack trace:

Pig Stack Trace
---------------
ERROR 2999: Unexpected internal error. Index: 29, Size: 29

java.lang.IndexOutOfBoundsException: Index: 29, Size: 29
 at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
 at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
 at
org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:427)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
 at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
at
org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:70)
 at
org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:72)
at
org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:55)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:121)
 at org.apache.pig.PigServer.getExamples(PigServer.java:731)
at
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:557)
 at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
 at org.apache.pig.Main.main(Main.java:374)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
================================================================================



On Sun, Nov 13, 2011 at 12:30 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> If you change the load statement to "load '$input' as (f1, f2, f3, f4,
> f5), f4 and f5 will be treated as null if they are absent in the raw
> logs.
>
> If you start relying on Pig heavily, lobby Amazon to upgrade their
> version of Pig (or at least provide both 0.6 and 0.9.1). At this
> point, 0.6 is positively ancient. But the extra field behavior worked
> that way then, too.
>
> D
>
> On Sat, Nov 12, 2011 at 4:08 PM, B M D Gill <bm...@gmail.com> wrote:
> > I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need to
> make
> > a change to add additional fields to the log files that I run my pig jobs
> > on  and am wondering how do I handle this schema in pig.
> >
> > My current inputs are tab separated fields that I input using the
> standard
> > pig storage function:
> >
> > LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);
> >
> > However some input files will now have additional fields f4, f5, f6 etc.
> at
> > the trailing edge of each line.  How do I set up the load function to
> > handle these optional fields?  Do I need to make changes to my logic to
> > deal with these fields possibly being empty or will Pig simply record
> their
> > value as null if they are absent?
> >
> > Thanks to anyone who can share some insight.
> >
>

Re: How to handle optional fields in schema

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

If you change the load statement to "load '$input' as (f1, f2, f3, f4,
f5), f4 and f5 will be treated as null if they are absent in the raw
logs.

If you start relying on Pig heavily, lobby Amazon to upgrade their
version of Pig (or at least provide both 0.6 and 0.9.1). At this
point, 0.6 is positively ancient. But the extra field behavior worked
that way then, too.

D

On Sat, Nov 12, 2011 at 4:08 PM, B M D Gill <bm...@gmail.com> wrote:
> I'm a newbie running Pig 0.6 on Amazon Elastic Map Reduce.  I need to make
> a change to add additional fields to the log files that I run my pig jobs
> on  and am wondering how do I handle this schema in pig.
>
> My current inputs are tab separated fields that I input using the standard
> pig storage function:
>
> LOAD '$INPUT' USING PigStorage('\t') as (f1, f2, f3);
>
> However some input files will now have additional fields f4, f5, f6 etc. at
> the trailing edge of each line.  How do I set up the load function to
> handle these optional fields?  Do I need to make changes to my logic to
> deal with these fields possibly being empty or will Pig simply record their
> value as null if they are absent?
>
> Thanks to anyone who can share some insight.
>