You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marcos Medrado Rubinelli <ma...@buscape-inc.com> on 2010/10/20 16:28:13 UTC

Defining schema only after first projection: what am I doing wrong?

Hi everybody,

I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations 
of log files with relatively long lines: 95 fields and growing, of which 
I'll be using just 7. Just so I didn't have to declare all the fields in 
the LOAD command, I tried to define the schema in my first 
FOREACH...GENERATE, so the first lines of my script look like this:

input = LOAD '/tmp/test.log';
A = FILTER input BY SIZE(*) >= 95;
B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
     (long)$23, (int)$2, (int)$3
     AS publisher, associate, site, category,
     story, hits, comments;

As you can guess by now, Pig complains while still parsing:

ERROR 1000: Error during parsing. Invalid alias: category in null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error 
during parsing. Invalid alias: associate in null
     at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
     at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
     at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
     at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)

Am I overlooking anything? Should I give up and declare a 95-field 
schema? Write a LOAD UDF? Or is there a simpler way to do what I want?

Thank you!
Marcos Rubinelli

Re: Defining schema only after first projection: what am I doing wrong?

Posted by Bryce Poole <br...@tynt.com>.
I believe the format of the FOREACH statement should be:

> B = FOREACH A GENERATE (long)$94 AS publisher, (chararray)$93 AS associate , (long)$16 AS site, (long)$27 AS category,
>   (long)$23 AS story, (int)$2 AS hits, (int)$3 AS comments;


Hope that helps,
Bryce

On Oct 21, 2010, at 8:15 PM, Renato Marroquín Mogrovejo wrote:

> Hi Marcos, just a quick question, have you check whether or not your data
> has all the fields in all the rows? Maybe you are dealing with sparse data,
> but due to the amount of data you are not noticing it.
> First, what does your data look like? My choice would be to first try with a
> subset of the whole data, and then write my own UDF to parse, and retrieve
> just the values I want.
> 
> 
> Renato M.
> 
> 2010/10/20 Marcos Medrado Rubinelli <ma...@buscape-inc.com>
> 
>> Hi everybody,
>> 
>> I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of
>> log files with relatively long lines: 95 fields and growing, of which I'll
>> be using just 7. Just so I didn't have to declare all the fields in the LOAD
>> command, I tried to define the schema in my first FOREACH...GENERATE, so the
>> first lines of my script look like this:
>> 
>> input = LOAD '/tmp/test.log';
>> A = FILTER input BY SIZE(*) >= 95;
>> B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
>>   (long)$23, (int)$2, (int)$3
>>   AS publisher, associate, site, category,
>>   story, hits, comments;
>> 
>> As you can guess by now, Pig complains while still parsing:
>> 
>> ERROR 1000: Error during parsing. Invalid alias: category in null
>> 
>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>> during parsing. Invalid alias: associate in null
>>   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
>>   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
>>   at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
>>   at
>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)
>> 
>> Am I overlooking anything? Should I give up and declare a 95-field schema?
>> Write a LOAD UDF? Or is there a simpler way to do what I want?
>> 
>> Thank you!
>> Marcos Rubinelli
>> 


Re: Defining schema only after first projection: what am I doing wrong?

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Marcos, just a quick question, have you check whether or not your data
has all the fields in all the rows? Maybe you are dealing with sparse data,
but due to the amount of data you are not noticing it.
First, what does your data look like? My choice would be to first try with a
subset of the whole data, and then write my own UDF to parse, and retrieve
just the values I want.


Renato M.

2010/10/20 Marcos Medrado Rubinelli <ma...@buscape-inc.com>

> Hi everybody,
>
> I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of
> log files with relatively long lines: 95 fields and growing, of which I'll
> be using just 7. Just so I didn't have to declare all the fields in the LOAD
> command, I tried to define the schema in my first FOREACH...GENERATE, so the
> first lines of my script look like this:
>
> input = LOAD '/tmp/test.log';
> A = FILTER input BY SIZE(*) >= 95;
> B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
>    (long)$23, (int)$2, (int)$3
>    AS publisher, associate, site, category,
>    story, hits, comments;
>
> As you can guess by now, Pig complains while still parsing:
>
> ERROR 1000: Error during parsing. Invalid alias: category in null
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
> during parsing. Invalid alias: associate in null
>    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
>    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
>    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
>    at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)
>
> Am I overlooking anything? Should I give up and declare a 95-field schema?
> Write a LOAD UDF? Or is there a simpler way to do what I want?
>
> Thank you!
> Marcos Rubinelli
>