You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2009/10/29 22:02:00 UTC

[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771615#action_12771615 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Skew-join uses the total number of input tuples, in PartitionSkewedKeys.calculateReducers(..) to calculate number of reducers.
In the version in trunk, PoissonSampleLoader adds  size on disk of the sampled tuple , as the last column of the tuple. This is used to calculate average size on disk in PartitionSkewedKeys. Total number of tuples are estimated using input-file-size/avg-size-of-tuple-on-disk .

But with the new interface, the size on disk for a tuple cannot be estimated (there is no getPosition). Also, the size of input file on disk cannot be estimated if the input is not from a file or if the load function is passed some metadata instead of file name.

Ideally this information should be obtained through  ResourceStatistics in the proposal. Since that is not available right now, here is another proposal - 

PoissonSampleLoader currently reads almost all the rows because it tries to sample evenly spaced tuples from the split. It will now read till the last tuple, and add an additional tuple that has the number of tuples in that split. This special tuple needs to be distinguished from others that are sampled tuples. I don't have a good way to do that except for having two columns first column having an unique marker string, and second column has the number of rows. Does anybody have better suggestions ?

PartitionSkewedKeys will look at all these special rows and add the row-nums to get total number of rows.


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Task
>            Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.