You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Bennie Schut <bs...@ebuddy.com> on 2009/11/13 12:02:56 UTC

pig using zebra, ClassNotFoundException on TableOutputFormat

I'm looking into improving the performance of one of my pig jobs. I
figured storing the data which I keep reusing in a binary/serialized
format could help me a little with this and thus stumbled upon zebra.
It seems like a nice abstraction and seems to do exactly what I want to
achieve.

I started with something simple but that doesn't work.

register zebra-0.6.0-dev.jar;
dim_calendar = load '/user/dwh/dim/calendar.csv' using PigStorage('\t')
as (cldr_id: long, iso_date: chararray);
outfile = order dim_calendar by iso_date parallel 1;
store outfile into '/user/dwh/calendar.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('cldr_id: long, iso_date:string');

On running this I get:
---------------
ERROR 2117: Unexpected error when launching map reduce job.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
to store alias 97
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1003)
at org.apache.pig.PigServer.registerQuery(PigServer.java:385)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:352)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
2117: Unexpected error when launching map reduce job.
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:194)
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:249)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:780)
at org.apache.pig.PigServer.execute(PigServer.java:773)
at org.apache.pig.PigServer.access$100(PigServer.java:89)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:951)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:998)
... 7 more
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.RuntimeException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.zebra.pig.TableOutputFormat
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:428)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1831)
-----

Any idea why?
TableOutputFormat is an inner class of TableStorer so I'm a little
puzzled how it could find one but not the other.
fyi.. I'm using hadoop-0.20.1 and pig/zebra from trunk but haven't
updated pig in a few weeks.

Thanks,
Bennie.

zebra adding records.

Posted by Bennie Schut <bs...@ebuddy.com>.
Anyone know of a way to add records to an existing zebra file?

I tried this:
new BasicTable.Writer(new Path(file), config);

But received this error:
java.io.IOException: ColumnGroup.Writer failed : Index meta file already
exists: newvalues/dim/users.csv.zebra/CG0/.meta

>From the code this seems correct behavior : "If path exists and contains
what look like a complete Column Group, * ColumnGroupExists exception
will be thrown. "



Re: zebra: unknown compressor none

Posted by Ashutosh Chauhan <as...@gmail.com>.
Hi Bennie,

So, you are using Zebra for its out of box serialization and
compression support. Thanks, for the explanation.

Ashutosh
On Wed, Nov 18, 2009 at 10:43, Bennie Schut <bs...@ebuddy.com> wrote:
> Hi Ashutosh,
>
> There are only 2 columns in the original file and in the zebra file and
> this is how I use it:
>
> the screenname file contains 2 fields a number and a string and is 14G
> in size, after transforming it into zebra 10G internally split into 80
> files.
> the chatsession file contains many fields both numeric and string and is
> 155M in size.
>
> register zebra-0.6.0-dev.jar;
> dim1258375560540 = load '/user/dwh/screenname_lzo_80.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/dwh/chatsessions/chatsessionsmap/output/chatsessions_1258534806969_0.csv'
> using PigStorage('\t') as (session_hash: chararray, email: chararray,
> refer_url: chararray, version: chararray, protocol: chararray,
> logintype: chararray, frontendversion: chararray, remote_ip: chararray,
> country: chararray, server_id, login_date, login_time, success,
> end_date, end_time, msg_sent, avg_msg_sent_size, msg_rcv,
> avg_msg_rcv_size, num_contacts, num_groups, num_sessions, secure_login,
> timeout, has_picture, screenname: chararray, useragent :chararray,
> error_code, masterlogin: chararray, unused :int);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 10;
> tmp12583755605401 = filter tmp1258375560540 by IsEmpty( dim1258375560540);
> tmp12583755605402 = foreach tmp12583755605401 generate
> flatten(fact1258375560540.screenname);
> tmp12583755605403 = distinct tmp12583755605402 PARALLEL 4;
> dump tmp12583755605403;
>
>
> It's basically trying to see if there are new values for screenname in
> the chatsessions file which are not in the screenname file.
> in sql it would be something like:
> select l.screenname
> from etl.chatsessions l
>  left join etl.screenname sn on (sn.screenname = l.screenname)
> where sn.screenname is null;
>
> In sql the screenname_id field is a numeric field so it's only a couple
> of bytes per record but on the plain text file it's many bytes per
> record I guess that's where the whole types branch was trying to solve
> at least internally however on hdfs the input and output are still many
> bytes.
> I was looking for a way to serialize the text file so these numbers
> would only be a few bytes and then found zebra which pretty much will do
> this for you.
> My hunch was when I would reduce the size it would gain a little
> performance simple because of copy speed.
> You probably get similar result if you would manually use a
> serialization+compression however that's a lot of work.
>
> I'm still going to try and produce a zebra file with the same number of
> mappers as the original text file would cause to make sure the speed
> difference isn't caused by more work per mapper being done.
>

Re: zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Hi Ashutosh,

There are only 2 columns in the original file and in the zebra file and
this is how I use it:

the screenname file contains 2 fields a number and a string and is 14G
in size, after transforming it into zebra 10G internally split into 80
files.
the chatsession file contains many fields both numeric and string and is
155M in size.

register zebra-0.6.0-dev.jar;
dim1258375560540 = load '/user/dwh/screenname_lzo_80.zebra' using
org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
fact1258375560540 = load
'/user/dwh/chatsessions/chatsessionsmap/output/chatsessions_1258534806969_0.csv'
using PigStorage('\t') as (session_hash: chararray, email: chararray,
refer_url: chararray, version: chararray, protocol: chararray,
logintype: chararray, frontendversion: chararray, remote_ip: chararray,
country: chararray, server_id, login_date, login_time, success,
end_date, end_time, msg_sent, avg_msg_sent_size, msg_rcv,
avg_msg_rcv_size, num_contacts, num_groups, num_sessions, secure_login,
timeout, has_picture, screenname: chararray, useragent :chararray,
error_code, masterlogin: chararray, unused :int);
tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
dim1258375560540 by code outer PARALLEL 10;
tmp12583755605401 = filter tmp1258375560540 by IsEmpty( dim1258375560540);
tmp12583755605402 = foreach tmp12583755605401 generate
flatten(fact1258375560540.screenname);
tmp12583755605403 = distinct tmp12583755605402 PARALLEL 4;
dump tmp12583755605403;


It's basically trying to see if there are new values for screenname in
the chatsessions file which are not in the screenname file.
in sql it would be something like:
select l.screenname
from etl.chatsessions l
  left join etl.screenname sn on (sn.screenname = l.screenname)
where sn.screenname is null;

In sql the screenname_id field is a numeric field so it's only a couple
of bytes per record but on the plain text file it's many bytes per
record I guess that's where the whole types branch was trying to solve
at least internally however on hdfs the input and output are still many
bytes.
I was looking for a way to serialize the text file so these numbers
would only be a few bytes and then found zebra which pretty much will do
this for you.
My hunch was when I would reduce the size it would gain a little
performance simple because of copy speed.
You probably get similar result if you would manually use a
serialization+compression however that's a lot of work.

I'm still going to try and produce a zebra file with the same number of
mappers as the original text file would cause to make sure the speed
difference isn't caused by more work per mapper being done.


Ashutosh Chauhan wrote:
> On Wed, Nov 18, 2009 at 08:32, Bennie Schut <bs...@ebuddy.com> wrote:
>
>   
>> Using zebra this way gives me a 27% speed improvement over using plain
>>     
>
> Interesting ! Can you add a bit more detail here? 27% speedup just
> because you are storing and loading your data through Zebra's table
> loader instead of using PigStorage. If so, is it because you have wide
> rows and you only are loading couple of columns out of many columns in
> your dataset?
>
> Thanks,
> Ashutosh
>   


Re: zebra: unknown compressor none

Posted by Ashutosh Chauhan <as...@gmail.com>.
On Wed, Nov 18, 2009 at 08:32, Bennie Schut <bs...@ebuddy.com> wrote:

> Using zebra this way gives me a 27% speed improvement over using plain

Interesting ! Can you add a bit more detail here? 27% speedup just
because you are storing and loading your data through Zebra's table
loader instead of using PigStorage. If so, is it because you have wide
rows and you only are loading couple of columns out of many columns in
your dataset?

Thanks,
Ashutosh

RE: zebra: unknown compressor none

Posted by Yan Zhou <ya...@yahoo-inc.com>.
For Zebra unsorted table, the number of mappers are limited by the
number of tfiles per column group because the input splits are generated
based upon tfiles not within tfiles. But the improvement in the form of
block-based split within tfile is coming up any time soon
(https://issues.apache.org/jira/browse/PIG-1077).

Regarding compression, yes, you were right that zebra does not support
"none" as a compression method.

Yan

------ Forwarded Message
From: Bennie Schut <bs...@ebuddy.com>
Reply-To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
Date: Wed, 18 Nov 2009 14:32:33 +0100
To: <pi...@hadoop.apache.org>
Subject: Re: zebra: unknown compressor none

Just for anyone else who might have this problem in the future.
I found a bit of a workaround. When you generate the zebra file/dir just
make sure you use something like a "order x by y parallel 20;" before
you do the store so in the zebra structure it will have 20 files and any
jobs using this file can then at least use 20 mappers.
It's not perfect though so if someone finds another way please let me
know.
Using zebra this way gives me a 27% speed improvement over using plain
text tab delimited files so even with the hack I'm happy :)

Bennie Schut wrote:
> Hi all,
>
> I still can't get pig to use multiple mappers when using zebra. I
tried
> using lzo hoping it would help but sadly no. The file is 14G tab
> delimited plain text and when using zebra with gz 7G and with lzo 10G.
> When I use the tab delimited file I get 216 mappers but with zebra
just
> 2 mappers of which 1 mapper is done almost instantly and the other
runs
> for hours. Any idea why it's not using more mappers?
>
> As an example of what I'm trying to do:
> dim1258375560540 = load '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
> PigStorage('\t') as (session_hash: chararray, email: chararray,
> screenname: chararray);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 4;
> dump tmp1258375560540;
>
>
> Thanks,
> Bennie
>
> Bennie Schut wrote:
>   
>> Another zebra related question.
>>
>> I couldn't find a lot of documentation on zebra but I figured you can
>> change compression codec with a syntax like this:
>> store outfile into '/user/dwh/screenname2.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>>
>> And in theory disable compression like this:
>> store outfile into '/user/dwh/screenname3.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>>
>> But it doesn't seem to understand the "none" as a compressor.
>> java.io.IOException: ColumnGroup.Writer constructor failed :
Partition
>> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
>> column 13.      
>> Was
>> expecting:      
>>
>>     <COMPRESSOR>
>> ...             
>>
>>                 
>>
>>         at
>>
org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116
)
>>
>>         at
>> 
org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStor
er.java:
154)               
>>
>>         at
>>
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
>>
>>         at
>> 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl
.java:24
7)                 
>>
>>         at
>>
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
>>
>>         at
>> java.lang.Thread.run(Thread.java:619)
>>
>>
>>
>> I actually tried this because when I use the zebra result on further
>> processing it only uses 2 mappers instead of the 230 mappers on the
>> original file. I remember hadoop can not split gz files so I figured
>> using compression might cause it to use so little mappers. Anyone
>> perhaps know a different approach on this?
>>
>> Thanks,
>> Bennie.
>>
>>   
>>     
>
>   


------ End of Forwarded Message


Re: zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Just for anyone else who might have this problem in the future.
I found a bit of a workaround. When you generate the zebra file/dir just
make sure you use something like a "order x by y parallel 20;" before
you do the store so in the zebra structure it will have 20 files and any
jobs using this file can then at least use 20 mappers.
It's not perfect though so if someone finds another way please let me know.
Using zebra this way gives me a 27% speed improvement over using plain
text tab delimited files so even with the hack I'm happy :)

Bennie Schut wrote:
> Hi all,
>
> I still can't get pig to use multiple mappers when using zebra. I tried
> using lzo hoping it would help but sadly no. The file is 14G tab
> delimited plain text and when using zebra with gz 7G and with lzo 10G.
> When I use the tab delimited file I get 216 mappers but with zebra just
> 2 mappers of which 1 mapper is done almost instantly and the other runs
> for hours. Any idea why it's not using more mappers?
>
> As an example of what I'm trying to do:
> dim1258375560540 = load '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
> PigStorage('\t') as (session_hash: chararray, email: chararray,
> screenname: chararray);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 4;
> dump tmp1258375560540;
>
>
> Thanks,
> Bennie
>
> Bennie Schut wrote:
>   
>> Another zebra related question.
>>
>> I couldn't find a lot of documentation on zebra but I figured you can
>> change compression codec with a syntax like this:
>> store outfile into '/user/dwh/screenname2.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>>
>> And in theory disable compression like this:
>> store outfile into '/user/dwh/screenname3.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>>
>> But it doesn't seem to understand the "none" as a compressor.
>> java.io.IOException: ColumnGroup.Writer constructor failed : Partition
>> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
>> column 13.              
>> Was
>> expecting:                                                                                                                                                    
>>
>>     <COMPRESSOR>
>> ...                                                                                                                                              
>>
>>                                                                                                                                                                   
>>
>>         at
>> org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)                                                                              
>>
>>         at
>> org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)                                                                   
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)                                                                               
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)                                                                                       
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                                                                                           
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)                                                                     
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)                                                                                
>>
>>         at
>> java.lang.Thread.run(Thread.java:619)                                                                                                                  
>>
>>
>>
>> I actually tried this because when I use the zebra result on further
>> processing it only uses 2 mappers instead of the 230 mappers on the
>> original file. I remember hadoop can not split gz files so I figured
>> using compression might cause it to use so little mappers. Anyone
>> perhaps know a different approach on this?
>>
>> Thanks,
>> Bennie.
>>
>>   
>>     
>
>   


Re: zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Hi all,

I still can't get pig to use multiple mappers when using zebra. I tried
using lzo hoping it would help but sadly no. The file is 14G tab
delimited plain text and when using zebra with gz 7G and with lzo 10G.
When I use the tab delimited file I get 216 mappers but with zebra just
2 mappers of which 1 mapper is done almost instantly and the other runs
for hours. Any idea why it's not using more mappers?

As an example of what I'm trying to do:
dim1258375560540 = load '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
fact1258375560540 = load
'/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
PigStorage('\t') as (session_hash: chararray, email: chararray,
screenname: chararray);
tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
dim1258375560540 by code outer PARALLEL 4;
dump tmp1258375560540;


Thanks,
Bennie

Bennie Schut wrote:
> Another zebra related question.
>
> I couldn't find a lot of documentation on zebra but I figured you can
> change compression codec with a syntax like this:
> store outfile into '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>
> And in theory disable compression like this:
> store outfile into '/user/dwh/screenname3.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>
> But it doesn't seem to understand the "none" as a compressor.
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition
> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
> column 13.              
> Was
> expecting:                                                                                                                                                    
>
>     <COMPRESSOR>
> ...                                                                                                                                              
>
>                                                                                                                                                                   
>
>         at
> org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)                                                                              
>
>         at
> org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)                                                                   
>
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)                                                                               
>
>         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)                                                                                       
>
>         at
> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                                                                                           
>
>         at
> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)                                                                     
>
>         at
> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)                                                                                
>
>         at
> java.lang.Thread.run(Thread.java:619)                                                                                                                  
>
>
>
> I actually tried this because when I use the zebra result on further
> processing it only uses 2 mappers instead of the 230 mappers on the
> original file. I remember hadoop can not split gz files so I figured
> using compression might cause it to use so little mappers. Anyone
> perhaps know a different approach on this?
>
> Thanks,
> Bennie.
>
>   


zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Another zebra related question.

I couldn't find a lot of documentation on zebra but I figured you can
change compression codec with a syntax like this:
store outfile into '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');

And in theory disable compression like this:
store outfile into '/user/dwh/screenname3.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('compress by none');

But it doesn't seem to understand the "none" as a compressor.
java.io.IOException: ColumnGroup.Writer constructor failed : Partition
constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
column 13.              
Was
expecting:                                                                                                                                                    

    <COMPRESSOR>
...                                                                                                                                              

                                                                                                                                                                  

        at
org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)                                                                              

        at
org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)                                                                   

        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)                                                                               

        at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)                                                                                       

        at
org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                                                                                           

        at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)                                                                     

        at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)                                                                                

        at
java.lang.Thread.run(Thread.java:619)                                                                                                                  



I actually tried this because when I use the zebra result on further
processing it only uses 2 mappers instead of the 230 mappers on the
original file. I remember hadoop can not split gz files so I figured
using compression might cause it to use so little mappers. Anyone
perhaps know a different approach on this?

Thanks,
Bennie.


Re: pig using zebra, ClassNotFoundException on TableOutputFormat

Posted by Bennie Schut <bs...@ebuddy.com>.
Ah thanks. Working like a charm now. Now I can play with the
TableInserter part.

Santhosh Srinivasan wrote:
> Bennie,
>
> Include zebra-0.6.0-dev.jar in your classpath and then relaunch pig.
>
> Santhosh 
>
> -----Original Message-----
> From: Bennie Schut [mailto:bschut@ebuddy.com] 
> Sent: Friday, November 13, 2009 3:03 AM
> To: pig-user@hadoop.apache.org
> Subject: pig using zebra, ClassNotFoundException on TableOutputFormat
>
> I'm looking into improving the performance of one of my pig jobs. I
> figured storing the data which I keep reusing in a binary/serialized
> format could help me a little with this and thus stumbled upon zebra.
> It seems like a nice abstraction and seems to do exactly what I want to
> achieve.
>
> I started with something simple but that doesn't work.
>
> register zebra-0.6.0-dev.jar;
> dim_calendar = load '/user/dwh/dim/calendar.csv' using PigStorage('\t')
> as (cldr_id: long, iso_date: chararray); outfile = order dim_calendar by
> iso_date parallel 1; store outfile into '/user/dwh/calendar.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('cldr_id: long,
> iso_date:string');
>
> On running this I get:
> ---------------
> ERROR 2117: Unexpected error when launching map reduce job.
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
> to store alias 97 at
> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1003)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:385)
> at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
> at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPar
> ser.java:324)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java
> :168)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java
> :144)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> at org.apache.pig.Main.main(Main.java:352)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
> 2117: Unexpected error when launching map reduce job.
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLa
> uncher.launchPig(MapReduceLauncher.java:194)
> at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(H
> ExecutionEngine.java:249)
> at
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:780)
> at org.apache.pig.PigServer.execute(PigServer.java:773)
> at org.apache.pig.PigServer.access$100(PigServer.java:89)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:951)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:998)
> ... 7 more
> Caused by: java.lang.RuntimeException: Could not resolve error that
> occured when launching map reduce job: java.lang.RuntimeException:
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.apache.hadoop.zebra.pig.TableOutputFormat
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLa
> uncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLaunc
> her.java:428)
> at java.lang.Thread.dispatchUncaughtException(Thread.java:1831)
> -----
>
> Any idea why?
> TableOutputFormat is an inner class of TableStorer so I'm a little
> puzzled how it could find one but not the other.
> fyi.. I'm using hadoop-0.20.1 and pig/zebra from trunk but haven't
> updated pig in a few weeks.
>
> Thanks,
> Bennie.
>   


RE: pig using zebra, ClassNotFoundException on TableOutputFormat

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
Bennie,

Include zebra-0.6.0-dev.jar in your classpath and then relaunch pig.

Santhosh 

-----Original Message-----
From: Bennie Schut [mailto:bschut@ebuddy.com] 
Sent: Friday, November 13, 2009 3:03 AM
To: pig-user@hadoop.apache.org
Subject: pig using zebra, ClassNotFoundException on TableOutputFormat

I'm looking into improving the performance of one of my pig jobs. I
figured storing the data which I keep reusing in a binary/serialized
format could help me a little with this and thus stumbled upon zebra.
It seems like a nice abstraction and seems to do exactly what I want to
achieve.

I started with something simple but that doesn't work.

register zebra-0.6.0-dev.jar;
dim_calendar = load '/user/dwh/dim/calendar.csv' using PigStorage('\t')
as (cldr_id: long, iso_date: chararray); outfile = order dim_calendar by
iso_date parallel 1; store outfile into '/user/dwh/calendar.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('cldr_id: long,
iso_date:string');

On running this I get:
---------------
ERROR 2117: Unexpected error when launching map reduce job.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
to store alias 97 at
org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1003)
at org.apache.pig.PigServer.registerQuery(PigServer.java:385)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPar
ser.java:324)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java
:168)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java
:144)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:352)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
2117: Unexpected error when launching map reduce job.
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLa
uncher.launchPig(MapReduceLauncher.java:194)
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(H
ExecutionEngine.java:249)
at
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:780)
at org.apache.pig.PigServer.execute(PigServer.java:773)
at org.apache.pig.PigServer.access$100(PigServer.java:89)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:951)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:998)
... 7 more
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.RuntimeException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.zebra.pig.TableOutputFormat
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLa
uncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLaunc
her.java:428)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1831)
-----

Any idea why?
TableOutputFormat is an inner class of TableStorer so I'm a little
puzzled how it could find one but not the other.
fyi.. I'm using hadoop-0.20.1 and pig/zebra from trunk but haven't
updated pig in a few weeks.

Thanks,
Bennie.