You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marcos Medrado Rubinelli <ma...@buscape-inc.com> on 2010/10/20 16:28:13 UTC
Defining schema only after first projection: what am I doing wrong?
Hi everybody,
I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations
of log files with relatively long lines: 95 fields and growing, of which
I'll be using just 7. Just so I didn't have to declare all the fields in
the LOAD command, I tried to define the schema in my first
FOREACH...GENERATE, so the first lines of my script look like this:
input = LOAD '/tmp/test.log';
A = FILTER input BY SIZE(*) >= 95;
B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
(long)$23, (int)$2, (int)$3
AS publisher, associate, site, category,
story, hits, comments;
As you can guess by now, Pig complains while still parsing:
ERROR 1000: Error during parsing. Invalid alias: category in null
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
during parsing. Invalid alias: associate in null
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)
Am I overlooking anything? Should I give up and declare a 95-field
schema? Write a LOAD UDF? Or is there a simpler way to do what I want?
Thank you!
Marcos Rubinelli
Re: Defining schema only after first projection: what am I doing wrong?
Posted by Bryce Poole <br...@tynt.com>.
I believe the format of the FOREACH statement should be:
> B = FOREACH A GENERATE (long)$94 AS publisher, (chararray)$93 AS associate , (long)$16 AS site, (long)$27 AS category,
> (long)$23 AS story, (int)$2 AS hits, (int)$3 AS comments;
Hope that helps,
Bryce
On Oct 21, 2010, at 8:15 PM, Renato Marroquín Mogrovejo wrote:
> Hi Marcos, just a quick question, have you check whether or not your data
> has all the fields in all the rows? Maybe you are dealing with sparse data,
> but due to the amount of data you are not noticing it.
> First, what does your data look like? My choice would be to first try with a
> subset of the whole data, and then write my own UDF to parse, and retrieve
> just the values I want.
>
>
> Renato M.
>
> 2010/10/20 Marcos Medrado Rubinelli <ma...@buscape-inc.com>
>
>> Hi everybody,
>>
>> I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of
>> log files with relatively long lines: 95 fields and growing, of which I'll
>> be using just 7. Just so I didn't have to declare all the fields in the LOAD
>> command, I tried to define the schema in my first FOREACH...GENERATE, so the
>> first lines of my script look like this:
>>
>> input = LOAD '/tmp/test.log';
>> A = FILTER input BY SIZE(*) >= 95;
>> B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
>> (long)$23, (int)$2, (int)$3
>> AS publisher, associate, site, category,
>> story, hits, comments;
>>
>> As you can guess by now, Pig complains while still parsing:
>>
>> ERROR 1000: Error during parsing. Invalid alias: category in null
>>
>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>> during parsing. Invalid alias: associate in null
>> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
>> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
>> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
>> at
>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)
>>
>> Am I overlooking anything? Should I give up and declare a 95-field schema?
>> Write a LOAD UDF? Or is there a simpler way to do what I want?
>>
>> Thank you!
>> Marcos Rubinelli
>>
Re: Defining schema only after first projection: what am I doing wrong?
Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Marcos, just a quick question, have you check whether or not your data
has all the fields in all the rows? Maybe you are dealing with sparse data,
but due to the amount of data you are not noticing it.
First, what does your data look like? My choice would be to first try with a
subset of the whole data, and then write my own UDF to parse, and retrieve
just the values I want.
Renato M.
2010/10/20 Marcos Medrado Rubinelli <ma...@buscape-inc.com>
> Hi everybody,
>
> I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of
> log files with relatively long lines: 95 fields and growing, of which I'll
> be using just 7. Just so I didn't have to declare all the fields in the LOAD
> command, I tried to define the schema in my first FOREACH...GENERATE, so the
> first lines of my script look like this:
>
> input = LOAD '/tmp/test.log';
> A = FILTER input BY SIZE(*) >= 95;
> B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
> (long)$23, (int)$2, (int)$3
> AS publisher, associate, site, category,
> story, hits, comments;
>
> As you can guess by now, Pig complains while still parsing:
>
> ERROR 1000: Error during parsing. Invalid alias: category in null
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
> during parsing. Invalid alias: associate in null
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)
>
> Am I overlooking anything? Should I give up and declare a 95-field schema?
> Write a LOAD UDF? Or is there a simpler way to do what I want?
>
> Thank you!
> Marcos Rubinelli
>