You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Prasanth J (JIRA)" <ji...@apache.org> on 2012/09/04 04:38:07 UTC

[jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)

    [ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447460#comment-13447460 ] 

Prasanth J commented on PIG-2831:
---------------------------------

Hi Dmitriy,

I have implemented the new inter storage with statistics gathering and new sample loader as per your idea on RB. Attached is the new patch containing the following changes
1) Added new RichInterStorage which implements StoreMetadata and LoadMetadata interfaces for storing and loading the statistics of intermediate data. RichInterStorage uses RichRecordReader, RichInputFormat for reading intermediate data and RichRecordWriter, RichOutputFormat for storing intermediate data. RichRecordWriter and RichOutputFormat are the same as InterRecordWriter and InterOutputFormat. The main difference is with the RichRecordReader and RichInputFormat. The RichInputFormat wraps all the splits to one logical split so that only one mapper is used for loading sample dataset. 
2) CubeSampleLoader uses underlying RichRecordReader for getting random samples of data. RichRecordReader opens utmost 100 inner splits and chooses a random split while reading the tuple. 
3) Changes to PigOutputCommitter for storing statistics. Statistics are stored at the end of every commitTask(). Statistics are stored for each output partition. RichInterStorage takes care of loading all the statistics corresponding to different partitions and aggregating them together. Statistics stores the numberOfRows and avgInMemTupleSize for each partitions (only these two values are required for holistic cubing).

This patch is quite bigger mainly because most of the changes (at the logical layer) are due to an old formatting issue which I fixed in this patch. Sorry about that. 

I have also updated the patch in RB. Please review it and let me know your feedback. Also I have kept some of the issues opened in your earlier review comments which require some of your thoughts. 

                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, PIG-2831.3.git.patch, PIG-2831.4.git.patch, PIG-2831.5.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm and generates annotated cube lattice (contains large group partitioning information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira