You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Bennie Schut <bs...@ebuddy.com> on 2009/11/17 11:03:05 UTC

zebra: unknown compressor none

Another zebra related question.

I couldn't find a lot of documentation on zebra but I figured you can
change compression codec with a syntax like this:
store outfile into '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');

And in theory disable compression like this:
store outfile into '/user/dwh/screenname3.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('compress by none');

But it doesn't seem to understand the "none" as a compressor.
java.io.IOException: ColumnGroup.Writer constructor failed : Partition
constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
column 13.              
Was
expecting:                                                                                                                                                    

    <COMPRESSOR>
...                                                                                                                                              

                                                                                                                                                                  

        at
org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)                                                                              

        at
org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)                                                                   

        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)                                                                               

        at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)                                                                                       

        at
org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                                                                                           

        at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)                                                                     

        at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)                                                                                

        at
java.lang.Thread.run(Thread.java:619)                                                                                                                  



I actually tried this because when I use the zebra result on further
processing it only uses 2 mappers instead of the 230 mappers on the
original file. I remember hadoop can not split gz files so I figured
using compression might cause it to use so little mappers. Anyone
perhaps know a different approach on this?

Thanks,
Bennie.


RE: zebra: unknown compressor none

Posted by Yan Zhou <ya...@yahoo-inc.com>.
For Zebra unsorted table, the number of mappers are limited by the
number of tfiles per column group because the input splits are generated
based upon tfiles not within tfiles. But the improvement in the form of
block-based split within tfile is coming up any time soon
(https://issues.apache.org/jira/browse/PIG-1077).

Regarding compression, yes, you were right that zebra does not support
"none" as a compression method.

Yan

------ Forwarded Message
From: Bennie Schut <bs...@ebuddy.com>
Reply-To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
Date: Wed, 18 Nov 2009 14:32:33 +0100
To: <pi...@hadoop.apache.org>
Subject: Re: zebra: unknown compressor none

Just for anyone else who might have this problem in the future.
I found a bit of a workaround. When you generate the zebra file/dir just
make sure you use something like a "order x by y parallel 20;" before
you do the store so in the zebra structure it will have 20 files and any
jobs using this file can then at least use 20 mappers.
It's not perfect though so if someone finds another way please let me
know.
Using zebra this way gives me a 27% speed improvement over using plain
text tab delimited files so even with the hack I'm happy :)

Bennie Schut wrote:
> Hi all,
>
> I still can't get pig to use multiple mappers when using zebra. I
tried
> using lzo hoping it would help but sadly no. The file is 14G tab
> delimited plain text and when using zebra with gz 7G and with lzo 10G.
> When I use the tab delimited file I get 216 mappers but with zebra
just
> 2 mappers of which 1 mapper is done almost instantly and the other
runs
> for hours. Any idea why it's not using more mappers?
>
> As an example of what I'm trying to do:
> dim1258375560540 = load '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
> PigStorage('\t') as (session_hash: chararray, email: chararray,
> screenname: chararray);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 4;
> dump tmp1258375560540;
>
>
> Thanks,
> Bennie
>
> Bennie Schut wrote:
>   
>> Another zebra related question.
>>
>> I couldn't find a lot of documentation on zebra but I figured you can
>> change compression codec with a syntax like this:
>> store outfile into '/user/dwh/screenname2.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>>
>> And in theory disable compression like this:
>> store outfile into '/user/dwh/screenname3.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>>
>> But it doesn't seem to understand the "none" as a compressor.
>> java.io.IOException: ColumnGroup.Writer constructor failed :
Partition
>> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
>> column 13.      
>> Was
>> expecting:      
>>
>>     <COMPRESSOR>
>> ...             
>>
>>                 
>>
>>         at
>>
org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116
)
>>
>>         at
>> 
org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStor
er.java:
154)               
>>
>>         at
>>
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
>>
>>         at
>> 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl
.java:24
7)                 
>>
>>         at
>>
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
>>
>>         at
>> java.lang.Thread.run(Thread.java:619)
>>
>>
>>
>> I actually tried this because when I use the zebra result on further
>> processing it only uses 2 mappers instead of the 230 mappers on the
>> original file. I remember hadoop can not split gz files so I figured
>> using compression might cause it to use so little mappers. Anyone
>> perhaps know a different approach on this?
>>
>> Thanks,
>> Bennie.
>>
>>   
>>     
>
>   


------ End of Forwarded Message


zebra adding records.

Posted by Bennie Schut <bs...@ebuddy.com>.
Anyone know of a way to add records to an existing zebra file?

I tried this:
new BasicTable.Writer(new Path(file), config);

But received this error:
java.io.IOException: ColumnGroup.Writer failed : Index meta file already
exists: newvalues/dim/users.csv.zebra/CG0/.meta

>From the code this seems correct behavior : "If path exists and contains
what look like a complete Column Group, * ColumnGroupExists exception
will be thrown. "



Re: zebra: unknown compressor none

Posted by Ashutosh Chauhan <as...@gmail.com>.
Hi Bennie,

So, you are using Zebra for its out of box serialization and
compression support. Thanks, for the explanation.

Ashutosh
On Wed, Nov 18, 2009 at 10:43, Bennie Schut <bs...@ebuddy.com> wrote:
> Hi Ashutosh,
>
> There are only 2 columns in the original file and in the zebra file and
> this is how I use it:
>
> the screenname file contains 2 fields a number and a string and is 14G
> in size, after transforming it into zebra 10G internally split into 80
> files.
> the chatsession file contains many fields both numeric and string and is
> 155M in size.
>
> register zebra-0.6.0-dev.jar;
> dim1258375560540 = load '/user/dwh/screenname_lzo_80.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/dwh/chatsessions/chatsessionsmap/output/chatsessions_1258534806969_0.csv'
> using PigStorage('\t') as (session_hash: chararray, email: chararray,
> refer_url: chararray, version: chararray, protocol: chararray,
> logintype: chararray, frontendversion: chararray, remote_ip: chararray,
> country: chararray, server_id, login_date, login_time, success,
> end_date, end_time, msg_sent, avg_msg_sent_size, msg_rcv,
> avg_msg_rcv_size, num_contacts, num_groups, num_sessions, secure_login,
> timeout, has_picture, screenname: chararray, useragent :chararray,
> error_code, masterlogin: chararray, unused :int);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 10;
> tmp12583755605401 = filter tmp1258375560540 by IsEmpty( dim1258375560540);
> tmp12583755605402 = foreach tmp12583755605401 generate
> flatten(fact1258375560540.screenname);
> tmp12583755605403 = distinct tmp12583755605402 PARALLEL 4;
> dump tmp12583755605403;
>
>
> It's basically trying to see if there are new values for screenname in
> the chatsessions file which are not in the screenname file.
> in sql it would be something like:
> select l.screenname
> from etl.chatsessions l
>  left join etl.screenname sn on (sn.screenname = l.screenname)
> where sn.screenname is null;
>
> In sql the screenname_id field is a numeric field so it's only a couple
> of bytes per record but on the plain text file it's many bytes per
> record I guess that's where the whole types branch was trying to solve
> at least internally however on hdfs the input and output are still many
> bytes.
> I was looking for a way to serialize the text file so these numbers
> would only be a few bytes and then found zebra which pretty much will do
> this for you.
> My hunch was when I would reduce the size it would gain a little
> performance simple because of copy speed.
> You probably get similar result if you would manually use a
> serialization+compression however that's a lot of work.
>
> I'm still going to try and produce a zebra file with the same number of
> mappers as the original text file would cause to make sure the speed
> difference isn't caused by more work per mapper being done.
>

Re: zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Hi Ashutosh,

There are only 2 columns in the original file and in the zebra file and
this is how I use it:

the screenname file contains 2 fields a number and a string and is 14G
in size, after transforming it into zebra 10G internally split into 80
files.
the chatsession file contains many fields both numeric and string and is
155M in size.

register zebra-0.6.0-dev.jar;
dim1258375560540 = load '/user/dwh/screenname_lzo_80.zebra' using
org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
fact1258375560540 = load
'/user/dwh/chatsessions/chatsessionsmap/output/chatsessions_1258534806969_0.csv'
using PigStorage('\t') as (session_hash: chararray, email: chararray,
refer_url: chararray, version: chararray, protocol: chararray,
logintype: chararray, frontendversion: chararray, remote_ip: chararray,
country: chararray, server_id, login_date, login_time, success,
end_date, end_time, msg_sent, avg_msg_sent_size, msg_rcv,
avg_msg_rcv_size, num_contacts, num_groups, num_sessions, secure_login,
timeout, has_picture, screenname: chararray, useragent :chararray,
error_code, masterlogin: chararray, unused :int);
tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
dim1258375560540 by code outer PARALLEL 10;
tmp12583755605401 = filter tmp1258375560540 by IsEmpty( dim1258375560540);
tmp12583755605402 = foreach tmp12583755605401 generate
flatten(fact1258375560540.screenname);
tmp12583755605403 = distinct tmp12583755605402 PARALLEL 4;
dump tmp12583755605403;


It's basically trying to see if there are new values for screenname in
the chatsessions file which are not in the screenname file.
in sql it would be something like:
select l.screenname
from etl.chatsessions l
  left join etl.screenname sn on (sn.screenname = l.screenname)
where sn.screenname is null;

In sql the screenname_id field is a numeric field so it's only a couple
of bytes per record but on the plain text file it's many bytes per
record I guess that's where the whole types branch was trying to solve
at least internally however on hdfs the input and output are still many
bytes.
I was looking for a way to serialize the text file so these numbers
would only be a few bytes and then found zebra which pretty much will do
this for you.
My hunch was when I would reduce the size it would gain a little
performance simple because of copy speed.
You probably get similar result if you would manually use a
serialization+compression however that's a lot of work.

I'm still going to try and produce a zebra file with the same number of
mappers as the original text file would cause to make sure the speed
difference isn't caused by more work per mapper being done.


Ashutosh Chauhan wrote:
> On Wed, Nov 18, 2009 at 08:32, Bennie Schut <bs...@ebuddy.com> wrote:
>
>   
>> Using zebra this way gives me a 27% speed improvement over using plain
>>     
>
> Interesting ! Can you add a bit more detail here? 27% speedup just
> because you are storing and loading your data through Zebra's table
> loader instead of using PigStorage. If so, is it because you have wide
> rows and you only are loading couple of columns out of many columns in
> your dataset?
>
> Thanks,
> Ashutosh
>   


Re: zebra: unknown compressor none

Posted by Ashutosh Chauhan <as...@gmail.com>.
On Wed, Nov 18, 2009 at 08:32, Bennie Schut <bs...@ebuddy.com> wrote:

> Using zebra this way gives me a 27% speed improvement over using plain

Interesting ! Can you add a bit more detail here? 27% speedup just
because you are storing and loading your data through Zebra's table
loader instead of using PigStorage. If so, is it because you have wide
rows and you only are loading couple of columns out of many columns in
your dataset?

Thanks,
Ashutosh

Re: zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Just for anyone else who might have this problem in the future.
I found a bit of a workaround. When you generate the zebra file/dir just
make sure you use something like a "order x by y parallel 20;" before
you do the store so in the zebra structure it will have 20 files and any
jobs using this file can then at least use 20 mappers.
It's not perfect though so if someone finds another way please let me know.
Using zebra this way gives me a 27% speed improvement over using plain
text tab delimited files so even with the hack I'm happy :)

Bennie Schut wrote:
> Hi all,
>
> I still can't get pig to use multiple mappers when using zebra. I tried
> using lzo hoping it would help but sadly no. The file is 14G tab
> delimited plain text and when using zebra with gz 7G and with lzo 10G.
> When I use the tab delimited file I get 216 mappers but with zebra just
> 2 mappers of which 1 mapper is done almost instantly and the other runs
> for hours. Any idea why it's not using more mappers?
>
> As an example of what I'm trying to do:
> dim1258375560540 = load '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
> PigStorage('\t') as (session_hash: chararray, email: chararray,
> screenname: chararray);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 4;
> dump tmp1258375560540;
>
>
> Thanks,
> Bennie
>
> Bennie Schut wrote:
>   
>> Another zebra related question.
>>
>> I couldn't find a lot of documentation on zebra but I figured you can
>> change compression codec with a syntax like this:
>> store outfile into '/user/dwh/screenname2.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>>
>> And in theory disable compression like this:
>> store outfile into '/user/dwh/screenname3.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>>
>> But it doesn't seem to understand the "none" as a compressor.
>> java.io.IOException: ColumnGroup.Writer constructor failed : Partition
>> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
>> column 13.              
>> Was
>> expecting:                                                                                                                                                    
>>
>>     <COMPRESSOR>
>> ...                                                                                                                                              
>>
>>                                                                                                                                                                   
>>
>>         at
>> org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)                                                                              
>>
>>         at
>> org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)                                                                   
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)                                                                               
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)                                                                                       
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                                                                                           
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)                                                                     
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)                                                                                
>>
>>         at
>> java.lang.Thread.run(Thread.java:619)                                                                                                                  
>>
>>
>>
>> I actually tried this because when I use the zebra result on further
>> processing it only uses 2 mappers instead of the 230 mappers on the
>> original file. I remember hadoop can not split gz files so I figured
>> using compression might cause it to use so little mappers. Anyone
>> perhaps know a different approach on this?
>>
>> Thanks,
>> Bennie.
>>
>>   
>>     
>
>   


Re: zebra: unknown compressor none

Posted by Bennie Schut <bs...@ebuddy.com>.
Hi all,

I still can't get pig to use multiple mappers when using zebra. I tried
using lzo hoping it would help but sadly no. The file is 14G tab
delimited plain text and when using zebra with gz 7G and with lzo 10G.
When I use the tab delimited file I get 216 mappers but with zebra just
2 mappers of which 1 mapper is done almost instantly and the other runs
for hours. Any idea why it's not using more mappers?

As an example of what I'm trying to do:
dim1258375560540 = load '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
fact1258375560540 = load
'/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
PigStorage('\t') as (session_hash: chararray, email: chararray,
screenname: chararray);
tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
dim1258375560540 by code outer PARALLEL 4;
dump tmp1258375560540;


Thanks,
Bennie

Bennie Schut wrote:
> Another zebra related question.
>
> I couldn't find a lot of documentation on zebra but I figured you can
> change compression codec with a syntax like this:
> store outfile into '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>
> And in theory disable compression like this:
> store outfile into '/user/dwh/screenname3.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>
> But it doesn't seem to understand the "none" as a compressor.
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition
> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
> column 13.              
> Was
> expecting:                                                                                                                                                    
>
>     <COMPRESSOR>
> ...                                                                                                                                              
>
>                                                                                                                                                                   
>
>         at
> org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)                                                                              
>
>         at
> org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)                                                                   
>
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)                                                                               
>
>         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)                                                                                       
>
>         at
> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                                                                                           
>
>         at
> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)                                                                     
>
>         at
> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)                                                                                
>
>         at
> java.lang.Thread.run(Thread.java:619)                                                                                                                  
>
>
>
> I actually tried this because when I use the zebra result on further
> processing it only uses 2 mappers instead of the 230 mappers on the
> original file. I remember hadoop can not split gz files so I figured
> using compression might cause it to use so little mappers. Anyone
> perhaps know a different approach on this?
>
> Thanks,
> Bennie.
>
>