You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by sonia gehlot <so...@gmail.com> on 2011/06/27 23:13:32 UTC

Tokenize in Pig

Hi All,

I want to tokenize a string and then want to do some processing on token.

for example:

line = "Today is first day of the week and its Monday"

p = load 'file.txt' as (line: chararray);
t = foreach p generate TOKENIZE(line);

result will be ({Today, is, first, day, of, the, week, and, its, Monday})

Now I want join t with some other data set using 'day'. How can I address
day like $3?

I knw I can store it and again load it by defining schema and then do
processing

But is it possible to do it without storing or loading?

Any ideas?

Thanks,
Sonia

Re: Tokenize in Pig

Posted by sonia gehlot <so...@gmail.com>.

I will write UDF or use STRSPLIT.

Thanks Guys,

Sonia

On Mon, Jun 27, 2011 at 3:37 PM, Ramesh, Amit <am...@amazon.com> wrote:

>
> You could use STRSPLIT instead, which returns a tuple.
>
>
> On 6/27/11 3:33 PM, "Bill Graham" <bi...@gmail.com> wrote:
>
> > The problem you'll run into is that TOKENIZE creates a bag of 1-tuple
> words
> > and bags are unordered.
> > http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#TOKENIZE
> >
> > I've gotten around this by writing a tokenize UDF that produces a tuple
> of
> > words instead.
> >
> > On Mon, Jun 27, 2011 at 2:13 PM, sonia gehlot <sonia.gehlot@gmail.com
> >wrote:
> >
> >> Hi All,
> >>
> >> I want to tokenize a string and then want to do some processing on
> token.
> >>
> >> for example:
> >>
> >> line = "Today is first day of the week and its Monday"
> >>
> >> p = load 'file.txt' as (line: chararray);
> >> t = foreach p generate TOKENIZE(line);
> >>
> >> result will be ({Today, is, first, day, of, the, week, and, its,
> Monday})
> >>
> >> Now I want join t with some other data set using 'day'. How can I
> address
> >> day like $3?
> >>
> >> I knw I can store it and again load it by defining schema and then do
> >> processing
> >>
> >> But is it possible to do it without storing or loading?
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Sonia
> >>
>
>

Re: Tokenize in Pig

Posted by "Ramesh, Amit" <am...@amazon.com>.

You could use STRSPLIT instead, which returns a tuple.


On 6/27/11 3:33 PM, "Bill Graham" <bi...@gmail.com> wrote:

> The problem you'll run into is that TOKENIZE creates a bag of 1-tuple words
> and bags are unordered.
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#TOKENIZE
> 
> I've gotten around this by writing a tokenize UDF that produces a tuple of
> words instead.
> 
> On Mon, Jun 27, 2011 at 2:13 PM, sonia gehlot <so...@gmail.com>wrote:
> 
>> Hi All,
>> 
>> I want to tokenize a string and then want to do some processing on token.
>> 
>> for example:
>> 
>> line = "Today is first day of the week and its Monday"
>> 
>> p = load 'file.txt' as (line: chararray);
>> t = foreach p generate TOKENIZE(line);
>> 
>> result will be ({Today, is, first, day, of, the, week, and, its, Monday})
>> 
>> Now I want join t with some other data set using 'day'. How can I address
>> day like $3?
>> 
>> I knw I can store it and again load it by defining schema and then do
>> processing
>> 
>> But is it possible to do it without storing or loading?
>> 
>> Any ideas?
>> 
>> Thanks,
>> Sonia
>>

Re: Tokenize in Pig

Posted by Bill Graham <bi...@gmail.com>.

The problem you'll run into is that TOKENIZE creates a bag of 1-tuple words
and bags are unordered.
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#TOKENIZE

I've gotten around this by writing a tokenize UDF that produces a tuple of
words instead.

On Mon, Jun 27, 2011 at 2:13 PM, sonia gehlot <so...@gmail.com>wrote:

> Hi All,
>
> I want to tokenize a string and then want to do some processing on token.
>
> for example:
>
> line = "Today is first day of the week and its Monday"
>
> p = load 'file.txt' as (line: chararray);
> t = foreach p generate TOKENIZE(line);
>
> result will be ({Today, is, first, day, of, the, week, and, its, Monday})
>
> Now I want join t with some other data set using 'day'. How can I address
> day like $3?
>
> I knw I can store it and again load it by defining schema and then do
> processing
>
> But is it possible to do it without storing or loading?
>
> Any ideas?
>
> Thanks,
> Sonia
>