You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Benjamin Smedberg <be...@smedbergs.us> on 2012/08/13 18:05:39 UTC

Distributed accumulator functions

I'm a new-ish pig user querying data on an hbase cluster. I have a 
question about accumulator-style functions.

When writing an accumulator-style UDF, is all of the data shipped to a 
single machine before it is reduced/accumulated? For example, if I were 
doing to write re-implement SUM as a UDF, it seems to me that it would 
be more efficient to run SUM on each map node, and then do a sum-of-sums 
when reducing. Is there a way to write a UDF which supports this style 
of accumulation/aggregation?

Also, is PigStorage compatible with the quoting expected by excel 
tab-delimited files? AIUI that would require quoting the values with 
"value\tvalue" and escaping double quotes. If this isn't the native 
PigStorage format, is there a storage module already written which 
supports excel-tab output?

--BDS

Re: Distributed accumulator functions

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

For CSV excel, check out
http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

D

>> Also, is PigStorage compatible with the quoting expected by excel tab-delimited files? AIUI that would require quoting the values with "value\tvalue" and escaping double quotes. If this isn't the native PigStorage format, is there a storage module already written which supports excel-tab output?
>
> PigStorage doesn't support escaping.  I am not aware of a storage function focussed on excel CSV format, but others may be.
>
> Alan.
>
>>
>> --BDS
>>
>

Re: Distributed accumulator functions

Posted by Alan Gates <ga...@hortonworks.com>.

On Aug 13, 2012, at 9:05 AM, Benjamin Smedberg wrote:

> I'm a new-ish pig user querying data on an hbase cluster. I have a question about accumulator-style functions.
> 
> When writing an accumulator-style UDF, is all of the data shipped to a single machine before it is reduced/accumulated? For example, if I were doing to write re-implement SUM as a UDF, it seems to me that it would be more efficient to run SUM on each map node, and then do a sum-of-sums when reducing. Is there a way to write a UDF which supports this style of accumulation/aggregation?

How many reducers are involved in an operation is independent of the type of UDF you use.  The number of reducers is determined by the parallelism you declare in your script (via the parallel clause in your group statement or via a set default parallelism statement in your script) or by the default Pig chooses.  

As to whether it is more efficient to do a sum of sums, it certainly is. For those types of operations you should use an algebraic UDF rather than an accumulative.  Algebraic UDFs have an initial (map), intermediate (combiner), and final (reducer) steps.  Accumulative UDFs are for operations that cannot be distributed but that only need to see the data stream once.  An example would be cumulative sums, where you want to return not just a final sum but a list of the sums as you went along.  This is order dependent and thus can't be done until you've collected all the values for a given key.

> 
> Also, is PigStorage compatible with the quoting expected by excel tab-delimited files? AIUI that would require quoting the values with "value\tvalue" and escaping double quotes. If this isn't the native PigStorage format, is there a storage module already written which supports excel-tab output?

PigStorage doesn't support escaping.  I am not aware of a storage function focussed on excel CSV format, but others may be.

Alan.

> 
> --BDS
>