You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by abh not <ab...@gmail.com> on 2011/06/27 09:49:38 UTC

extract string in Pig

Hi All,

I have few sample log:

   139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] "GET /favicon.ico HTTP/1.1"
200 766 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3)
Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"

If load this file in as string

a = load '/user/sample/log.txt' using PigStorage('/t') as (text: chararray);

then how can I extract a part of string from it, for example if I want to
extract date  '10/Apr/2007:10:40:54' from it, Then can I achieve this thing
using Pig script?

Any help or suggestions are welcome.

Thanks in advance.

Meenal

Re: extract string in Pig

Posted by abh not <ab...@gmail.com>.
Sure.

Thanks for help!

Meenal

On Mon, Jun 27, 2011 at 1:37 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Meenal,
>
> If you have the chance, I highly recommend looking up regular expressions.
> As a programmer, they will repay the investment learning them 1000 fold.
>
> 2011/6/27 abh not <ab...@gmail.com>
>
> > Hi Jon,
> >
> > thanks for reply, REGEX_EXTRACT looks pretty useful. But unfortunately I
> am
> > not that good in regex.
> >
> > can you please give one example what will be regex here to extract data
> > time
> > part.
> >
> > Thanks again.
> >
> > Meenal
> >
> > On Mon, Jun 27, 2011 at 5:53 AM, Jonathan Holloway <
> > jonathan.holloway@gmail.com> wrote:
> >
> > > Take a look at:
> > >
> > > REGEX_EXTRACT -
> > > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT
> > >
> > > and REGEX_EXTRACT_ALL:
> > >
> > > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT_ALL
> > >
> > > You could also use SUBSTRING, but I think a regex would be more
> > applicable
> > > here for date/time extraction.
> > >
> > > Cheers,
> > > Jon.
> > >
> > > On 27 June 2011 08:49, abh not <ab...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have few sample log:
> > > >
> > > >   139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] "GET /favicon.ico
> > HTTP/1.1"
> > > > 200 766 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3)
> > > > Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"
> > > >
> > > > If load this file in as string
> > > >
> > > > a = load '/user/sample/log.txt' using PigStorage('/t') as (text:
> > > > chararray);
> > > >
> > > > then how can I extract a part of string from it, for example if I
> want
> > to
> > > > extract date  '10/Apr/2007:10:40:54' from it, Then can I achieve this
> > > thing
> > > > using Pig script?
> > > >
> > > > Any help or suggestions are welcome.
> > > >
> > > > Thanks in advance.
> > > >
> > > > Meenal
> > > >
> > >
> >
>

Re: extract string in Pig

Posted by Jonathan Coveney <jc...@gmail.com>.
Meenal,

If you have the chance, I highly recommend looking up regular expressions.
As a programmer, they will repay the investment learning them 1000 fold.

2011/6/27 abh not <ab...@gmail.com>

> Hi Jon,
>
> thanks for reply, REGEX_EXTRACT looks pretty useful. But unfortunately I am
> not that good in regex.
>
> can you please give one example what will be regex here to extract data
> time
> part.
>
> Thanks again.
>
> Meenal
>
> On Mon, Jun 27, 2011 at 5:53 AM, Jonathan Holloway <
> jonathan.holloway@gmail.com> wrote:
>
> > Take a look at:
> >
> > REGEX_EXTRACT -
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT
> >
> > and REGEX_EXTRACT_ALL:
> >
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT_ALL
> >
> > You could also use SUBSTRING, but I think a regex would be more
> applicable
> > here for date/time extraction.
> >
> > Cheers,
> > Jon.
> >
> > On 27 June 2011 08:49, abh not <ab...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I have few sample log:
> > >
> > >   139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] "GET /favicon.ico
> HTTP/1.1"
> > > 200 766 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3)
> > > Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"
> > >
> > > If load this file in as string
> > >
> > > a = load '/user/sample/log.txt' using PigStorage('/t') as (text:
> > > chararray);
> > >
> > > then how can I extract a part of string from it, for example if I want
> to
> > > extract date  '10/Apr/2007:10:40:54' from it, Then can I achieve this
> > thing
> > > using Pig script?
> > >
> > > Any help or suggestions are welcome.
> > >
> > > Thanks in advance.
> > >
> > > Meenal
> > >
> >
>

Re: extract string in Pig

Posted by abh not <ab...@gmail.com>.
Hi Jon,

thanks for reply, REGEX_EXTRACT looks pretty useful. But unfortunately I am
not that good in regex.

can you please give one example what will be regex here to extract data time
part.

Thanks again.

Meenal

On Mon, Jun 27, 2011 at 5:53 AM, Jonathan Holloway <
jonathan.holloway@gmail.com> wrote:

> Take a look at:
>
> REGEX_EXTRACT -
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT
>
> and REGEX_EXTRACT_ALL:
>
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT_ALL
>
> You could also use SUBSTRING, but I think a regex would be more applicable
> here for date/time extraction.
>
> Cheers,
> Jon.
>
> On 27 June 2011 08:49, abh not <ab...@gmail.com> wrote:
>
> > Hi All,
> >
> > I have few sample log:
> >
> >   139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] "GET /favicon.ico HTTP/1.1"
> > 200 766 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3)
> > Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"
> >
> > If load this file in as string
> >
> > a = load '/user/sample/log.txt' using PigStorage('/t') as (text:
> > chararray);
> >
> > then how can I extract a part of string from it, for example if I want to
> > extract date  '10/Apr/2007:10:40:54' from it, Then can I achieve this
> thing
> > using Pig script?
> >
> > Any help or suggestions are welcome.
> >
> > Thanks in advance.
> >
> > Meenal
> >
>

Re: extract string in Pig

Posted by Jonathan Holloway <jo...@gmail.com>.
Take a look at:

REGEX_EXTRACT -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT

and REGEX_EXTRACT_ALL:

http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#REGEX_EXTRACT_ALL

You could also use SUBSTRING, but I think a regex would be more applicable
here for date/time extraction.

Cheers,
Jon.

On 27 June 2011 08:49, abh not <ab...@gmail.com> wrote:

> Hi All,
>
> I have few sample log:
>
>   139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] "GET /favicon.ico HTTP/1.1"
> 200 766 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3)
> Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"
>
> If load this file in as string
>
> a = load '/user/sample/log.txt' using PigStorage('/t') as (text:
> chararray);
>
> then how can I extract a part of string from it, for example if I want to
> extract date  '10/Apr/2007:10:40:54' from it, Then can I achieve this thing
> using Pig script?
>
> Any help or suggestions are welcome.
>
> Thanks in advance.
>
> Meenal
>