You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by praveenesh kumar <pr...@gmail.com> on 2013/09/15 21:03:52 UTC

Splitting by unique values in a relation

Hi,

I have a relation A with (customer_id, data).
I want to get the unique customer_ids, and spilt them into new
files/relations. What is the most efficient way to do that.

I can get the distinct customer_ids in a relation. But not able to
understand how can can I use it in splitting the data by customer_id.

Regards
Praveenesh

Re: Splitting by unique values in a relation

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Sorry. I didn't know/understand that you had unknown values. Yes, in your
case MultiStorage is a good way to split the data according to the values
of a column. It worked for me in similar cases.

Thanks


On Mon, Sep 16, 2013 at 4:06 AM, praveenesh kumar <pr...@gmail.com>wrote:

> Okay, I might not be able to explain the right scenario. Apologize if I was
> not clear enough with my problem.
>
> My scenario -
>
> I have a relation A, that has unique number of (unknown) customer_ids. I
> want to create different (N) number of output files per customer_id. I was
> thinking of finding the unique customer_ids first and then I was confused
> on how to go ahead, which made me to post the question.
>
> Through some further googling, I found piggybank's MultiStorage UDF that
> does this kind of operation, which in my case would do the job.
> Anyways, I was just thinking, if I had to do some other operation, eg
> filtering by unique customer ids, how would you achieve that in pig.
>
> SPLIT would need some known criteria to split into relations. Please
> correct me if I am wrong there. When values are unknown, how can we achieve
> the same.
>
> Regards
> Praveenesh
>
>
> On Mon, Sep 16, 2013 at 12:44 AM, Shahab Yunus <shahab.yunus@gmail.com
> >wrote:
>
> > Correction in my earlier comment. The following statement that I wrote
> was
> > wrong:
> > 'Won't SPLIT always give you 2 relations?'
> >
> > It is basically what Praveenesh himself mentioned i.e. a
> pre-defined/known
> > number of relations/splits.
> >
> > Regards,
> > Shahab
> >
> >
> > On Sun, Sep 15, 2013 at 7:41 PM, praveenesh kumar <praveenesh@gmail.com
> > >wrote:
> >
> > > I can use split only when I am aware of the values by which I need to
> > split
> > > by... Here customer_ids are unknown to me. I don't know how many of
> them
> > > exist in my data. Hence SPLIT is not the answer to my problem.
> > >
> > > Anyways I have found piggybank's MultiStorage method much closer to
> what
> > I
> > > am looking for. I was just wondering is there a better or different way
> > to
> > > do the same.
> > >
> > > Regards
> > > Praveenesh
> > >
> > >
> > > On Mon, Sep 16, 2013 at 12:36 AM, Ruslan Al-Fakikh <
> metaruslan@gmail.com
> > > >wrote:
> > >
> > > > Hi!
> > > >
> > > > Have you tried the SPLIT operator?
> > > > http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> > > > After splitting the relation into two separate relations you can
> STORE
> > > them
> > > > into different locations.
> > > >
> > > > Best Regards,
> > > > Ruslan Al-Fakikh
> > > > https://www.odesk.com/users/~015b7b5f617eb89923
> > > >
> > > >
> > > > On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <
> > praveenesh@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a relation A with (customer_id, data).
> > > > > I want to get the unique customer_ids, and spilt them into new
> > > > > files/relations. What is the most efficient way to do that.
> > > > >
> > > > > I can get the distinct customer_ids in a relation. But not able to
> > > > > understand how can can I use it in splitting the data by
> customer_id.
> > > > >
> > > > > Regards
> > > > > Praveenesh
> > > > >
> > > >
> > >
> >
>

Re: Splitting by unique values in a relation

Posted by praveenesh kumar <pr...@gmail.com>.
Okay, I might not be able to explain the right scenario. Apologize if I was
not clear enough with my problem.

My scenario -

I have a relation A, that has unique number of (unknown) customer_ids. I
want to create different (N) number of output files per customer_id. I was
thinking of finding the unique customer_ids first and then I was confused
on how to go ahead, which made me to post the question.

Through some further googling, I found piggybank's MultiStorage UDF that
does this kind of operation, which in my case would do the job.
Anyways, I was just thinking, if I had to do some other operation, eg
filtering by unique customer ids, how would you achieve that in pig.

SPLIT would need some known criteria to split into relations. Please
correct me if I am wrong there. When values are unknown, how can we achieve
the same.

Regards
Praveenesh


On Mon, Sep 16, 2013 at 12:44 AM, Shahab Yunus <sh...@gmail.com>wrote:

> Correction in my earlier comment. The following statement that I wrote was
> wrong:
> 'Won't SPLIT always give you 2 relations?'
>
> It is basically what Praveenesh himself mentioned i.e. a pre-defined/known
> number of relations/splits.
>
> Regards,
> Shahab
>
>
> On Sun, Sep 15, 2013 at 7:41 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > I can use split only when I am aware of the values by which I need to
> split
> > by... Here customer_ids are unknown to me. I don't know how many of them
> > exist in my data. Hence SPLIT is not the answer to my problem.
> >
> > Anyways I have found piggybank's MultiStorage method much closer to what
> I
> > am looking for. I was just wondering is there a better or different way
> to
> > do the same.
> >
> > Regards
> > Praveenesh
> >
> >
> > On Mon, Sep 16, 2013 at 12:36 AM, Ruslan Al-Fakikh <metaruslan@gmail.com
> > >wrote:
> >
> > > Hi!
> > >
> > > Have you tried the SPLIT operator?
> > > http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> > > After splitting the relation into two separate relations you can STORE
> > them
> > > into different locations.
> > >
> > > Best Regards,
> > > Ruslan Al-Fakikh
> > > https://www.odesk.com/users/~015b7b5f617eb89923
> > >
> > >
> > > On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <
> praveenesh@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a relation A with (customer_id, data).
> > > > I want to get the unique customer_ids, and spilt them into new
> > > > files/relations. What is the most efficient way to do that.
> > > >
> > > > I can get the distinct customer_ids in a relation. But not able to
> > > > understand how can can I use it in splitting the data by customer_id.
> > > >
> > > > Regards
> > > > Praveenesh
> > > >
> > >
> >
>

Re: Splitting by unique values in a relation

Posted by Shahab Yunus <sh...@gmail.com>.
Correction in my earlier comment. The following statement that I wrote was
wrong:
'Won't SPLIT always give you 2 relations?'

It is basically what Praveenesh himself mentioned i.e. a pre-defined/known
number of relations/splits.

Regards,
Shahab


On Sun, Sep 15, 2013 at 7:41 PM, praveenesh kumar <pr...@gmail.com>wrote:

> I can use split only when I am aware of the values by which I need to split
> by... Here customer_ids are unknown to me. I don't know how many of them
> exist in my data. Hence SPLIT is not the answer to my problem.
>
> Anyways I have found piggybank's MultiStorage method much closer to what I
> am looking for. I was just wondering is there a better or different way to
> do the same.
>
> Regards
> Praveenesh
>
>
> On Mon, Sep 16, 2013 at 12:36 AM, Ruslan Al-Fakikh <metaruslan@gmail.com
> >wrote:
>
> > Hi!
> >
> > Have you tried the SPLIT operator?
> > http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> > After splitting the relation into two separate relations you can STORE
> them
> > into different locations.
> >
> > Best Regards,
> > Ruslan Al-Fakikh
> > https://www.odesk.com/users/~015b7b5f617eb89923
> >
> >
> > On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <praveenesh@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I have a relation A with (customer_id, data).
> > > I want to get the unique customer_ids, and spilt them into new
> > > files/relations. What is the most efficient way to do that.
> > >
> > > I can get the distinct customer_ids in a relation. But not able to
> > > understand how can can I use it in splitting the data by customer_id.
> > >
> > > Regards
> > > Praveenesh
> > >
> >
>

Re: Splitting by unique values in a relation

Posted by praveenesh kumar <pr...@gmail.com>.
I can use split only when I am aware of the values by which I need to split
by... Here customer_ids are unknown to me. I don't know how many of them
exist in my data. Hence SPLIT is not the answer to my problem.

Anyways I have found piggybank's MultiStorage method much closer to what I
am looking for. I was just wondering is there a better or different way to
do the same.

Regards
Praveenesh


On Mon, Sep 16, 2013 at 12:36 AM, Ruslan Al-Fakikh <me...@gmail.com>wrote:

> Hi!
>
> Have you tried the SPLIT operator?
> http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> After splitting the relation into two separate relations you can STORE them
> into different locations.
>
> Best Regards,
> Ruslan Al-Fakikh
> https://www.odesk.com/users/~015b7b5f617eb89923
>
>
> On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > Hi,
> >
> > I have a relation A with (customer_id, data).
> > I want to get the unique customer_ids, and spilt them into new
> > files/relations. What is the most efficient way to do that.
> >
> > I can get the distinct customer_ids in a relation. But not able to
> > understand how can can I use it in splitting the data by customer_id.
> >
> > Regards
> > Praveenesh
> >
>

Re: Splitting by unique values in a relation

Posted by Shahab Yunus <sh...@gmail.com>.
I thought about SPLIT too and as well about a nested FILTER within a
FOREACH but the OP can have any number of distinct ids on which he wants to
split (thus getting the same number of splits/relations.) Won't SPLIT
always give you 2 relations.

Regards,
Shahab


On Sun, Sep 15, 2013 at 7:36 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:

> Hi!
>
> Have you tried the SPLIT operator?
> http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> After splitting the relation into two separate relations you can STORE them
> into different locations.
>
> Best Regards,
> Ruslan Al-Fakikh
> https://www.odesk.com/users/~015b7b5f617eb89923
>
>
> On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > Hi,
> >
> > I have a relation A with (customer_id, data).
> > I want to get the unique customer_ids, and spilt them into new
> > files/relations. What is the most efficient way to do that.
> >
> > I can get the distinct customer_ids in a relation. But not able to
> > understand how can can I use it in splitting the data by customer_id.
> >
> > Regards
> > Praveenesh
> >
>

Re: Splitting by unique values in a relation

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi!

Have you tried the SPLIT operator?
http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
After splitting the relation into two separate relations you can STORE them
into different locations.

Best Regards,
Ruslan Al-Fakikh
https://www.odesk.com/users/~015b7b5f617eb89923


On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <pr...@gmail.com>wrote:

> Hi,
>
> I have a relation A with (customer_id, data).
> I want to get the unique customer_ids, and spilt them into new
> files/relations. What is the most efficient way to do that.
>
> I can get the distinct customer_ids in a relation. But not able to
> understand how can can I use it in splitting the data by customer_id.
>
> Regards
> Praveenesh
>