You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Juan Martin Pampliega <jp...@gmail.com> on 2011/07/25 20:01:23 UTC

Merging multiple columns into 2 columns

I have data in an HBase table in stored in the following format:

rowkey  group_id:1 group_id:2       ...  group_id:n
2fcab50712467eab4004583eb8fb7f89 1 0 1
085125e8f7cdc99fd91dbd7280373c5b 0 1 0
dd53e23487da03fd02396306d248cda0 2 1 0

where the column family group_id contains one column for each set of data
and the number is the number of times that the hash is present in the set of
data.

I need to reformat the data and obtain the output in the following format:

hash group_id
2fcab50712467eab4004583eb8fb7f89             1
dd53e23487da03fd02396306d248cda0             1
dd53e23487da03fd02396306d248cda0             1
085125e8f7cdc99fd91dbd7280373c5b             2
dd53e23487da03fd02396306d248cda0             2
...
2fcab50712467eab4004583eb8fb7f89             n

Any ideas on how to achieve this? I'm really at a loss here.

Re: Merging multiple columns into 2 columns

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Sounds like you need a udf that takes the hash returned by Pig when you load
a column family, and returns a bag of column names, each column name
repeated as many times as indicated by the value of the column. You would
then flatten the result of this udf.

D

On Mon, Jul 25, 2011 at 11:01 AM, Juan Martin Pampliega <
jpampliega@gmail.com> wrote:

> I have data in an HBase table in stored in the following format:
>
> rowkey  group_id:1 group_id:2       ...  group_id:n
> 2fcab50712467eab4004583eb8fb7f89 1 0 1
> 085125e8f7cdc99fd91dbd7280373c5b 0 1 0
> dd53e23487da03fd02396306d248cda0 2 1 0
>
> where the column family group_id contains one column for each set of data
> and the number is the number of times that the hash is present in the set
> of
> data.
>
> I need to reformat the data and obtain the output in the following format:
>
> hash group_id
> 2fcab50712467eab4004583eb8fb7f89             1
> dd53e23487da03fd02396306d248cda0             1
> dd53e23487da03fd02396306d248cda0             1
> 085125e8f7cdc99fd91dbd7280373c5b             2
> dd53e23487da03fd02396306d248cda0             2
> ...
> 2fcab50712467eab4004583eb8fb7f89             n
>
> Any ideas on how to achieve this? I'm really at a loss here.
>