You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by sonia gehlot <so...@gmail.com> on 2011/06/27 23:13:32 UTC
Tokenize in Pig
Hi All,
I want to tokenize a string and then want to do some processing on token.
for example:
line = "Today is first day of the week and its Monday"
p = load 'file.txt' as (line: chararray);
t = foreach p generate TOKENIZE(line);
result will be ({Today, is, first, day, of, the, week, and, its, Monday})
Now I want join t with some other data set using 'day'. How can I address
day like $3?
I knw I can store it and again load it by defining schema and then do
processing
But is it possible to do it without storing or loading?
Any ideas?
Thanks,
Sonia
Re: Tokenize in Pig
Posted by sonia gehlot <so...@gmail.com>.
I will write UDF or use STRSPLIT.
Thanks Guys,
Sonia
On Mon, Jun 27, 2011 at 3:37 PM, Ramesh, Amit <am...@amazon.com> wrote:
>
> You could use STRSPLIT instead, which returns a tuple.
>
>
> On 6/27/11 3:33 PM, "Bill Graham" <bi...@gmail.com> wrote:
>
> > The problem you'll run into is that TOKENIZE creates a bag of 1-tuple
> words
> > and bags are unordered.
> > http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#TOKENIZE
> >
> > I've gotten around this by writing a tokenize UDF that produces a tuple
> of
> > words instead.
> >
> > On Mon, Jun 27, 2011 at 2:13 PM, sonia gehlot <sonia.gehlot@gmail.com
> >wrote:
> >
> >> Hi All,
> >>
> >> I want to tokenize a string and then want to do some processing on
> token.
> >>
> >> for example:
> >>
> >> line = "Today is first day of the week and its Monday"
> >>
> >> p = load 'file.txt' as (line: chararray);
> >> t = foreach p generate TOKENIZE(line);
> >>
> >> result will be ({Today, is, first, day, of, the, week, and, its,
> Monday})
> >>
> >> Now I want join t with some other data set using 'day'. How can I
> address
> >> day like $3?
> >>
> >> I knw I can store it and again load it by defining schema and then do
> >> processing
> >>
> >> But is it possible to do it without storing or loading?
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Sonia
> >>
>
>
Re: Tokenize in Pig
Posted by "Ramesh, Amit" <am...@amazon.com>.
You could use STRSPLIT instead, which returns a tuple.
On 6/27/11 3:33 PM, "Bill Graham" <bi...@gmail.com> wrote:
> The problem you'll run into is that TOKENIZE creates a bag of 1-tuple words
> and bags are unordered.
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#TOKENIZE
>
> I've gotten around this by writing a tokenize UDF that produces a tuple of
> words instead.
>
> On Mon, Jun 27, 2011 at 2:13 PM, sonia gehlot <so...@gmail.com>wrote:
>
>> Hi All,
>>
>> I want to tokenize a string and then want to do some processing on token.
>>
>> for example:
>>
>> line = "Today is first day of the week and its Monday"
>>
>> p = load 'file.txt' as (line: chararray);
>> t = foreach p generate TOKENIZE(line);
>>
>> result will be ({Today, is, first, day, of, the, week, and, its, Monday})
>>
>> Now I want join t with some other data set using 'day'. How can I address
>> day like $3?
>>
>> I knw I can store it and again load it by defining schema and then do
>> processing
>>
>> But is it possible to do it without storing or loading?
>>
>> Any ideas?
>>
>> Thanks,
>> Sonia
>>
Re: Tokenize in Pig
Posted by Bill Graham <bi...@gmail.com>.
The problem you'll run into is that TOKENIZE creates a bag of 1-tuple words
and bags are unordered.
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#TOKENIZE
I've gotten around this by writing a tokenize UDF that produces a tuple of
words instead.
On Mon, Jun 27, 2011 at 2:13 PM, sonia gehlot <so...@gmail.com>wrote:
> Hi All,
>
> I want to tokenize a string and then want to do some processing on token.
>
> for example:
>
> line = "Today is first day of the week and its Monday"
>
> p = load 'file.txt' as (line: chararray);
> t = foreach p generate TOKENIZE(line);
>
> result will be ({Today, is, first, day, of, the, week, and, its, Monday})
>
> Now I want join t with some other data set using 'day'. How can I address
> day like $3?
>
> I knw I can store it and again load it by defining schema and then do
> processing
>
> But is it possible to do it without storing or loading?
>
> Any ideas?
>
> Thanks,
> Sonia
>