You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Adam Szita (JIRA)" <ji...@apache.org> on 2016/12/01 14:10:58 UTC

[jira] [Commented] (PIG-4901) To use Multistorage for each Group

    [ https://issues.apache.org/jira/browse/PIG-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712053#comment-15712053 ] 

Adam Szita commented on PIG-4901:
---------------------------------

By default MultiStorage is capable to do multi-level storage as your example shows.
I attached a patch to upgrade MultiStorage in piggybank to support any number of levels.

Once it goes in you can use it as:
{code}
register '../contrib/piggybank/java/piggybank.jar';
A = LOAD 'multiStorage.txt' USING PigStorage() AS (DateString:chararray, Name:chararray, Col3:chararray, Col4:chararray);
B = foreach A generate ToDate(DateString) AS Date, DateString, Name, Col3, Col4;
C = foreach B generate CONCAT((chararray)(GetYear(Date)),CONCAT('-',(chararray)(GetMonth(Date)))) AS Month, DateString, Name, Col3, Col4;
STORE C into 'multiOut' USING org.apache.pig.piggybank.storage.MultiStorage('multiOut','2,0', 'none', '\\t');
{code}

There is also on option to skip writing values of key columns.
[~daijy] can you please take a look at the patch?

> To use Multistorage for each Group
> ----------------------------------
>
>                 Key: PIG-4901
>                 URL: https://issues.apache.org/jira/browse/PIG-4901
>             Project: Pig
>          Issue Type: Task
>          Components: piggybank
>    Affects Versions: 0.11.1, 0.16.0
>         Environment: Hadoop 1.2.0
>            Reporter: Divya
>            Assignee: Adam Szita
>            Priority: Minor
>             Fix For: 0.17.0
>
>         Attachments: PIG-4901.patch
>
>
> I am trying to group my data and store in hdfs with a folder for each 'name' and subfolders for each 'YearMonth' under each name folder.
> Input:
> (Date)            (name)     (col3)     (col4)
> 2015-02-02    abc              y          z
> 2016-01-02    xyz              i            j
> 2015-03-02    abc              f          b
> 2015-02-06    abc              y          z
> 2016-03-02    xyz              a          q
>     
> Expected out in hdfs:
> abc folder
>     ->201502 subfolder
>            2015-02-02    abc              y          z
>            2015-02-06    abc              y          z
>     ->201503 subfolder
>            2015-03-02    abc              f           b
> xyz folder
>     ->201601
>           2016-01-02    xyz              i            j
>     ->201603
>           2016-03-02    xyz              a          q
> I am not sure of how to use the Multistorage option on Name column after grouping the tuples by date.
> Any help is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)