You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Adam Szita (JIRA)" <ji...@apache.org> on 2016/12/01 14:10:58 UTC
[jira] [Commented] (PIG-4901) To use Multistorage for each Group
[ https://issues.apache.org/jira/browse/PIG-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712053#comment-15712053 ]
Adam Szita commented on PIG-4901:
---------------------------------
By default MultiStorage is capable to do multi-level storage as your example shows.
I attached a patch to upgrade MultiStorage in piggybank to support any number of levels.
Once it goes in you can use it as:
{code}
register '../contrib/piggybank/java/piggybank.jar';
A = LOAD 'multiStorage.txt' USING PigStorage() AS (DateString:chararray, Name:chararray, Col3:chararray, Col4:chararray);
B = foreach A generate ToDate(DateString) AS Date, DateString, Name, Col3, Col4;
C = foreach B generate CONCAT((chararray)(GetYear(Date)),CONCAT('-',(chararray)(GetMonth(Date)))) AS Month, DateString, Name, Col3, Col4;
STORE C into 'multiOut' USING org.apache.pig.piggybank.storage.MultiStorage('multiOut','2,0', 'none', '\\t');
{code}
There is also on option to skip writing values of key columns.
[~daijy] can you please take a look at the patch?
> To use Multistorage for each Group
> ----------------------------------
>
> Key: PIG-4901
> URL: https://issues.apache.org/jira/browse/PIG-4901
> Project: Pig
> Issue Type: Task
> Components: piggybank
> Affects Versions: 0.11.1, 0.16.0
> Environment: Hadoop 1.2.0
> Reporter: Divya
> Assignee: Adam Szita
> Priority: Minor
> Fix For: 0.17.0
>
> Attachments: PIG-4901.patch
>
>
> I am trying to group my data and store in hdfs with a folder for each 'name' and subfolders for each 'YearMonth' under each name folder.
> Input:
> (Date) (name) (col3) (col4)
> 2015-02-02 abc y z
> 2016-01-02 xyz i j
> 2015-03-02 abc f b
> 2015-02-06 abc y z
> 2016-03-02 xyz a q
>
> Expected out in hdfs:
> abc folder
> ->201502 subfolder
> 2015-02-02 abc y z
> 2015-02-06 abc y z
> ->201503 subfolder
> 2015-03-02 abc f b
> xyz folder
> ->201601
> 2016-01-02 xyz i j
> ->201603
> 2016-03-02 xyz a q
> I am not sure of how to use the Multistorage option on Name column after grouping the tuples by date.
> Any help is appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)