You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yves Roy <yv...@cossette.com> on 2010/11/30 18:16:47 UTC

LOAD data USING to parse data in order to obtain the AS as desired.

Hello:

I hope this is not double posting.

I want to do something simple:

I have a data file, mydata.log,  formatted like this:

a1 | b1 | c=foo&d=bar | e1
a2 | b2 | c=john&d=doe | e2
a3 | b3 | c=foo&d=doe | e3
...

and I want to LOAD the data USING <something> in order to get the AS to be
(A,B,C,D, E) i.e. extract 2 fields from the third one.

For example :

data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);

i.e. I want the third field (i.e. the one formatted as 'cx=foox&dx=barx') to
be parsed to yield the C and D in my AS list of fields
so that later on I can do things like:

data_cfoo = FILTER data BY c == 'foo';
data_cfoo_ddoe = FILTER data_cfoo BY d='doe';


There has to have a simple way way to do that ?
Passing a regex, a ruby script or what else as a parameter to PigStorage, or
using something else than PigStorage?

Many thanks

Yves

YVES
DE FJORD

   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
T 514 270 8782 #4572 / F 514 270 4162 / cossette.com

Re: LOAD data USING to parse data in order to obtain the AS as desired.

Posted by John Hui <jo...@gmail.com>.
What could be the exact command that you would use?  It should just be one
line right?

On Tue, Nov 30, 2010 at 12:58 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> An easier approach would be to just use PigStorage('|') to get the
> pipe-delimited fields, and use STRSPLIT to break up the third column into
> multiple columns.
>
> -D
>
> On Tue, Nov 30, 2010 at 9:26 AM, John Hui <jo...@gmail.com> wrote:
>
> > You can try using  a customer storage parser.
> >
> > You can see a bunch of examples here..
> >
> >
> >
> pig-0.7.0/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage
> >
> > I wrote one for JSON.
> >
> > On Tue, Nov 30, 2010 at 12:16 PM, Yves Roy <yv...@cossette.com>
> wrote:
> >
> > > Hello:
> > >
> > > I hope this is not double posting.
> > >
> > > I want to do something simple:
> > >
> > > I have a data file, mydata.log,  formatted like this:
> > >
> > > a1 | b1 | c=foo&d=bar | e1
> > > a2 | b2 | c=john&d=doe | e2
> > > a3 | b3 | c=foo&d=doe | e3
> > > ...
> > >
> > > and I want to LOAD the data USING <something> in order to get the AS to
> > be
> > > (A,B,C,D, E) i.e. extract 2 fields from the third one.
> > >
> > > For example :
> > >
> > > data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);
> > >
> > > i.e. I want the third field (i.e. the one formatted as
> 'cx=foox&dx=barx')
> > > to
> > > be parsed to yield the C and D in my AS list of fields
> > > so that later on I can do things like:
> > >
> > > data_cfoo = FILTER data BY c == 'foo';
> > > data_cfoo_ddoe = FILTER data_cfoo BY d='doe';
> > >
> > >
> > > There has to have a simple way way to do that ?
> > > Passing a regex, a ruby script or what else as a parameter to
> PigStorage,
> > > or
> > > using something else than PigStorage?
> > >
> > > Many thanks
> > >
> > > Yves
> > >
> > > YVES
> > > DE FJORD
> > >
> > >   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
> > > 2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
> > > T 514 270 8782 #4572 / F 514 270 4162 / cossette.com
> > >
> >
>

Re: LOAD data USING to parse data in order to obtain the AS as desired.

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Try this:

table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant stuff);
pared = foreach table generate n1, n2;
grouped = group pared by n1;
counted  = foreach grouped generate group, 
(double)COUNT(pared.n2)/COUNT_STAR(pared.n2) as ratio;
ordered = order counted by ratio desc;
limited = limit ordered 200;
dump limited;

Daniel

Yves Roy wrote:
> Thanks Dmitriy.
>
> Regarding your suggestion to use PigStorage('|') and STRSPLIT:
>
> a) yes, PigStorage('|') does work fine (I started from there), but how to
> have it work with the AS clause, which contains 5 fields (A,B,C,D,E) ans not
> only 3 corresponding to the split using the delimiter '|'.
>
> b) As for the STRSPLIT, when and where should it be used, in order to match
> with the AS clause (the A, B, C, D, E) so that I can, later, i.e. after the
> LOADing of the data :
>
> (will this work with 5 fields?)
> data = LOAD 'mydata.log' USING PigStorage('|') AS (A, B, C, D, E);
>
> (then, where/how goes the STRSPLIT usage ?)
>
> or should I start with only 4 fields:
>
> data = LOAD 'mydata.log' USING PigStorage('|') AS (A, B, CD, E);
>
> and then use STRSPLIT (how?), again, in order to having the following
> commands to work as expected:
>
> data_cfoo = FILTER data BY C == 'foo';
> data_cfoo_ddoe = FILTER data_cfoo BY D='doe';
>
> Thanks
> Yves
>
> YVES
> DE FJORD
>
>    YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
> 2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
> T 514 270 8782 #4572 / F 514 270 4162 / cossette.com
>
>
>
> On Tue, Nov 30, 2010 at 12:58 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>   
>> An easier approach would be to just use PigStorage('|') to get the
>> pipe-delimited fields, and use STRSPLIT to break up the third column into
>> multiple columns.
>>
>> -D
>>
>> On Tue, Nov 30, 2010 at 9:26 AM, John Hui <jo...@gmail.com> wrote:
>>
>>     
>>> You can try using  a customer storage parser.
>>>
>>> You can see a bunch of examples here..
>>>
>>>
>>>
>>>       
>> pig-0.7.0/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage
>>     
>>> I wrote one for JSON.
>>>
>>> On Tue, Nov 30, 2010 at 12:16 PM, Yves Roy <yv...@cossette.com>
>>>       
>> wrote:
>>     
>>>> Hello:
>>>>
>>>> I hope this is not double posting.
>>>>
>>>> I want to do something simple:
>>>>
>>>> I have a data file, mydata.log,  formatted like this:
>>>>
>>>> a1 | b1 | c=foo&d=bar | e1
>>>> a2 | b2 | c=john&d=doe | e2
>>>> a3 | b3 | c=foo&d=doe | e3
>>>> ...
>>>>
>>>> and I want to LOAD the data USING <something> in order to get the AS to
>>>>         
>>> be
>>>       
>>>> (A,B,C,D, E) i.e. extract 2 fields from the third one.
>>>>
>>>> For example :
>>>>
>>>> data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);
>>>>
>>>> i.e. I want the third field (i.e. the one formatted as
>>>>         
>> 'cx=foox&dx=barx')
>>     
>>>> to
>>>> be parsed to yield the C and D in my AS list of fields
>>>> so that later on I can do things like:
>>>>
>>>> data_cfoo = FILTER data BY c == 'foo';
>>>> data_cfoo_ddoe = FILTER data_cfoo BY d='doe';
>>>>
>>>>
>>>> There has to have a simple way way to do that ?
>>>> Passing a regex, a ruby script or what else as a parameter to
>>>>         
>> PigStorage,
>>     
>>>> or
>>>> using something else than PigStorage?
>>>>
>>>> Many thanks
>>>>
>>>> Yves
>>>>
>>>> YVES
>>>> DE FJORD
>>>>
>>>>   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
>>>> 2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
>>>> T 514 270 8782 #4572 / F 514 270 4162 / cossette.com
>>>>
>>>>         


Re: LOAD data USING to parse data in order to obtain the AS as desired.

Posted by Yves Roy <yv...@cossette.com>.
Thanks Dmitriy.

Regarding your suggestion to use PigStorage('|') and STRSPLIT:

a) yes, PigStorage('|') does work fine (I started from there), but how to
have it work with the AS clause, which contains 5 fields (A,B,C,D,E) ans not
only 3 corresponding to the split using the delimiter '|'.

b) As for the STRSPLIT, when and where should it be used, in order to match
with the AS clause (the A, B, C, D, E) so that I can, later, i.e. after the
LOADing of the data :

(will this work with 5 fields?)
data = LOAD 'mydata.log' USING PigStorage('|') AS (A, B, C, D, E);

(then, where/how goes the STRSPLIT usage ?)

or should I start with only 4 fields:

data = LOAD 'mydata.log' USING PigStorage('|') AS (A, B, CD, E);

and then use STRSPLIT (how?), again, in order to having the following
commands to work as expected:

data_cfoo = FILTER data BY C == 'foo';
data_cfoo_ddoe = FILTER data_cfoo BY D='doe';

Thanks
Yves

YVES
DE FJORD

   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
T 514 270 8782 #4572 / F 514 270 4162 / cossette.com



On Tue, Nov 30, 2010 at 12:58 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> An easier approach would be to just use PigStorage('|') to get the
> pipe-delimited fields, and use STRSPLIT to break up the third column into
> multiple columns.
>
> -D
>
> On Tue, Nov 30, 2010 at 9:26 AM, John Hui <jo...@gmail.com> wrote:
>
> > You can try using  a customer storage parser.
> >
> > You can see a bunch of examples here..
> >
> >
> >
> pig-0.7.0/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage
> >
> > I wrote one for JSON.
> >
> > On Tue, Nov 30, 2010 at 12:16 PM, Yves Roy <yv...@cossette.com>
> wrote:
> >
> > > Hello:
> > >
> > > I hope this is not double posting.
> > >
> > > I want to do something simple:
> > >
> > > I have a data file, mydata.log,  formatted like this:
> > >
> > > a1 | b1 | c=foo&d=bar | e1
> > > a2 | b2 | c=john&d=doe | e2
> > > a3 | b3 | c=foo&d=doe | e3
> > > ...
> > >
> > > and I want to LOAD the data USING <something> in order to get the AS to
> > be
> > > (A,B,C,D, E) i.e. extract 2 fields from the third one.
> > >
> > > For example :
> > >
> > > data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);
> > >
> > > i.e. I want the third field (i.e. the one formatted as
> 'cx=foox&dx=barx')
> > > to
> > > be parsed to yield the C and D in my AS list of fields
> > > so that later on I can do things like:
> > >
> > > data_cfoo = FILTER data BY c == 'foo';
> > > data_cfoo_ddoe = FILTER data_cfoo BY d='doe';
> > >
> > >
> > > There has to have a simple way way to do that ?
> > > Passing a regex, a ruby script or what else as a parameter to
> PigStorage,
> > > or
> > > using something else than PigStorage?
> > >
> > > Many thanks
> > >
> > > Yves
> > >
> > > YVES
> > > DE FJORD
> > >
> > >   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
> > > 2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
> > > T 514 270 8782 #4572 / F 514 270 4162 / cossette.com
> > >
> >
>

Re: LOAD data USING to parse data in order to obtain the AS as desired.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
An easier approach would be to just use PigStorage('|') to get the
pipe-delimited fields, and use STRSPLIT to break up the third column into
multiple columns.

-D

On Tue, Nov 30, 2010 at 9:26 AM, John Hui <jo...@gmail.com> wrote:

> You can try using  a customer storage parser.
>
> You can see a bunch of examples here..
>
>
> pig-0.7.0/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage
>
> I wrote one for JSON.
>
> On Tue, Nov 30, 2010 at 12:16 PM, Yves Roy <yv...@cossette.com> wrote:
>
> > Hello:
> >
> > I hope this is not double posting.
> >
> > I want to do something simple:
> >
> > I have a data file, mydata.log,  formatted like this:
> >
> > a1 | b1 | c=foo&d=bar | e1
> > a2 | b2 | c=john&d=doe | e2
> > a3 | b3 | c=foo&d=doe | e3
> > ...
> >
> > and I want to LOAD the data USING <something> in order to get the AS to
> be
> > (A,B,C,D, E) i.e. extract 2 fields from the third one.
> >
> > For example :
> >
> > data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);
> >
> > i.e. I want the third field (i.e. the one formatted as 'cx=foox&dx=barx')
> > to
> > be parsed to yield the C and D in my AS list of fields
> > so that later on I can do things like:
> >
> > data_cfoo = FILTER data BY c == 'foo';
> > data_cfoo_ddoe = FILTER data_cfoo BY d='doe';
> >
> >
> > There has to have a simple way way to do that ?
> > Passing a regex, a ruby script or what else as a parameter to PigStorage,
> > or
> > using something else than PigStorage?
> >
> > Many thanks
> >
> > Yves
> >
> > YVES
> > DE FJORD
> >
> >   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
> > 2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
> > T 514 270 8782 #4572 / F 514 270 4162 / cossette.com
> >
>

Re: LOAD data USING to parse data in order to obtain the AS as desired.

Posted by John Hui <jo...@gmail.com>.
You can try using  a customer storage parser.

You can see a bunch of examples here..

pig-0.7.0/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage

I wrote one for JSON.

On Tue, Nov 30, 2010 at 12:16 PM, Yves Roy <yv...@cossette.com> wrote:

> Hello:
>
> I hope this is not double posting.
>
> I want to do something simple:
>
> I have a data file, mydata.log,  formatted like this:
>
> a1 | b1 | c=foo&d=bar | e1
> a2 | b2 | c=john&d=doe | e2
> a3 | b3 | c=foo&d=doe | e3
> ...
>
> and I want to LOAD the data USING <something> in order to get the AS to be
> (A,B,C,D, E) i.e. extract 2 fields from the third one.
>
> For example :
>
> data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);
>
> i.e. I want the third field (i.e. the one formatted as 'cx=foox&dx=barx')
> to
> be parsed to yield the C and D in my AS list of fields
> so that later on I can do things like:
>
> data_cfoo = FILTER data BY c == 'foo';
> data_cfoo_ddoe = FILTER data_cfoo BY d='doe';
>
>
> There has to have a simple way way to do that ?
> Passing a regex, a ruby script or what else as a parameter to PigStorage,
> or
> using something else than PigStorage?
>
> Many thanks
>
> Yves
>
> YVES
> DE FJORD
>
>   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
> 2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
> T 514 270 8782 #4572 / F 514 270 4162 / cossette.com
>