You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by kartik manocha <ko...@gmail.com> on 2014/05/12 09:16:46 UTC

Query : Filtering out string from a field

Hi,

I am new to pig & facing an issue in filtering out a string from a field,
mentioned is the scenario.

- > I am loading data with several fields, among those fields there is
field name called 'test_data'
- > There are lot of things in this field, I wanted to filter out a string
from this field which starts from B75 & ends with semi colon.
- > After taking this string out, wanted to add this as a new field to the
existing bag which was loaded

I tried using INDEXOF UDF, but that works for a single character only,
however when I tried using that for single character, it returns () only
instead of index number. I was just testing, & by manually providing
indexes in SUBSTRING UDF, it was generating string.

But unable to get the position using indexof UDF, or may be there could be
a better of doing this.

If you have any pointers / suggestions, please share.

Thanks in advance.


Best,
Kartik

Re: Query : Filtering out string from a field

Posted by kartik manocha <ko...@gmail.com>.
Hi folks,

I was able to filter that string out using an alternative approach, sharing
it as it might be useful for someone encountering the similar issue &
couldn't update for some reasons.

While using it as mentioned, was getting error related to mismatch &
expecting end of line with semi colon.

B=foreach D generate REGEX_EXTRACT(test,'(B75.*;)',1);

So instead, used nesting of foreach & it worked.

B = foreach D {



test1 = REGEX_EXTRACT(test,'(B75.*;)',1);

test2 = REPLACE(test1,'\\u003B','');  -- to remove the semi colon in the
last

GENERATE test2;

}



Cheers,

Kartik


On Mon, May 12, 2014 at 9:27 PM, kartik manocha <ko...@gmail.com>wrote:

> Thanks, it could be due to this bug as I'm using 0.11,1.
>
> Upgrade isn't an option feasible for me at the moment.
>
> Will try exploring writing UDF's, btw thanks for the quick response.
>
>
> Thanks,
> Kartik
>
>
> On Mon, May 12, 2014 at 9:11 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> Kartik,
>>
>> Looks like you're facing this issues:
>> https://issues.apache.org/jira/browse/PIG-2507
>> What version of Pig are you using? The issue is fixed in 0.11.2 and 0.12.
>> So if you upgrade to these versions, your problem should go away.
>>
>> If you're unable to upgrade for some reason, your best bet is to write a
>> custom UDF. But the general idea remains the same, write a regex to
>> extract
>> the appropriate substring and project that from the UDF.
>>
>>
>> Unmesha,
>>
>> Start a new thread with your question so we don't pollute this thread for
>> Kartik. Can you give some samples as well? I'm not sure I understood your
>> question.
>>
>>
>> On Mon, May 12, 2014 at 3:05 AM, kartik manocha <koolkartik87@gmail.com
>> >wrote:
>>
>> > Pradeep,
>> >
>> > Thanks for the pointers, but as i mentioned that I need to extract that
>> > string till semicolon, so facing issues with that.
>> >
>> > I need to print it before semiclon that's causing pain as when I mention
>> > semicolon in regex it treats it as end of statement & produces error.
>> >
>> > However without mentioning semicolon it works fine but produces complete
>> > stuff starting with B75.
>> > eg .
>> > B=foreach D generate REGEX_EXTRACT(test,'(B75.*)',1);
>> >
>> > Is there any way by which I can mention semicolon in my above regex, so
>> > that it prints the string before that.
>> >
>> >
>> > Thanks,
>> > Kartik
>> >
>> >
>> >
>> > On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota <
>> pradeepg26@gmail.com
>> > >wrote:
>> >
>> > > Check out
>> > >
>> http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT
>> > >
>> > > This may suit your needs
>> > >
>> > >
>> > > On Mon, May 12, 2014 at 12:16 AM, kartik manocha <
>> koolkartik87@gmail.com
>> > > >wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I am new to pig & facing an issue in filtering out a string from a
>> > field,
>> > > > mentioned is the scenario.
>> > > >
>> > > > - > I am loading data with several fields, among those fields there
>> is
>> > > > field name called 'test_data'
>> > > > - > There are lot of things in this field, I wanted to filter out a
>> > > string
>> > > > from this field which starts from B75 & ends with semi colon.
>> > > > - > After taking this string out, wanted to add this as a new field
>> to
>> > > the
>> > > > existing bag which was loaded
>> > > >
>> > > > I tried using INDEXOF UDF, but that works for a single character
>> only,
>> > > > however when I tried using that for single character, it returns ()
>> > only
>> > > > instead of index number. I was just testing, & by manually providing
>> > > > indexes in SUBSTRING UDF, it was generating string.
>> > > >
>> > > > But unable to get the position using indexof UDF, or may be there
>> could
>> > > be
>> > > > a better of doing this.
>> > > >
>> > > > If you have any pointers / suggestions, please share.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > > >
>> > > > Best,
>> > > > Kartik
>> > > >
>> > >
>> >
>>
>
>

Re: Query : Filtering out string from a field

Posted by kartik manocha <ko...@gmail.com>.
Thanks, it could be due to this bug as I'm using 0.11,1.

Upgrade isn't an option feasible for me at the moment.

Will try exploring writing UDF's, btw thanks for the quick response.


Thanks,
Kartik


On Mon, May 12, 2014 at 9:11 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Kartik,
>
> Looks like you're facing this issues:
> https://issues.apache.org/jira/browse/PIG-2507
> What version of Pig are you using? The issue is fixed in 0.11.2 and 0.12.
> So if you upgrade to these versions, your problem should go away.
>
> If you're unable to upgrade for some reason, your best bet is to write a
> custom UDF. But the general idea remains the same, write a regex to extract
> the appropriate substring and project that from the UDF.
>
>
> Unmesha,
>
> Start a new thread with your question so we don't pollute this thread for
> Kartik. Can you give some samples as well? I'm not sure I understood your
> question.
>
>
> On Mon, May 12, 2014 at 3:05 AM, kartik manocha <koolkartik87@gmail.com
> >wrote:
>
> > Pradeep,
> >
> > Thanks for the pointers, but as i mentioned that I need to extract that
> > string till semicolon, so facing issues with that.
> >
> > I need to print it before semiclon that's causing pain as when I mention
> > semicolon in regex it treats it as end of statement & produces error.
> >
> > However without mentioning semicolon it works fine but produces complete
> > stuff starting with B75.
> > eg .
> > B=foreach D generate REGEX_EXTRACT(test,'(B75.*)',1);
> >
> > Is there any way by which I can mention semicolon in my above regex, so
> > that it prints the string before that.
> >
> >
> > Thanks,
> > Kartik
> >
> >
> >
> > On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota <pradeepg26@gmail.com
> > >wrote:
> >
> > > Check out
> > > http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT
> > >
> > > This may suit your needs
> > >
> > >
> > > On Mon, May 12, 2014 at 12:16 AM, kartik manocha <
> koolkartik87@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I am new to pig & facing an issue in filtering out a string from a
> > field,
> > > > mentioned is the scenario.
> > > >
> > > > - > I am loading data with several fields, among those fields there
> is
> > > > field name called 'test_data'
> > > > - > There are lot of things in this field, I wanted to filter out a
> > > string
> > > > from this field which starts from B75 & ends with semi colon.
> > > > - > After taking this string out, wanted to add this as a new field
> to
> > > the
> > > > existing bag which was loaded
> > > >
> > > > I tried using INDEXOF UDF, but that works for a single character
> only,
> > > > however when I tried using that for single character, it returns ()
> > only
> > > > instead of index number. I was just testing, & by manually providing
> > > > indexes in SUBSTRING UDF, it was generating string.
> > > >
> > > > But unable to get the position using indexof UDF, or may be there
> could
> > > be
> > > > a better of doing this.
> > > >
> > > > If you have any pointers / suggestions, please share.
> > > >
> > > > Thanks in advance.
> > > >
> > > >
> > > > Best,
> > > > Kartik
> > > >
> > >
> >
>

Re: Query : Filtering out string from a field

Posted by Pradeep Gollakota <pr...@gmail.com>.
Kartik,

Looks like you're facing this issues:
https://issues.apache.org/jira/browse/PIG-2507
What version of Pig are you using? The issue is fixed in 0.11.2 and 0.12.
So if you upgrade to these versions, your problem should go away.

If you're unable to upgrade for some reason, your best bet is to write a
custom UDF. But the general idea remains the same, write a regex to extract
the appropriate substring and project that from the UDF.


Unmesha,

Start a new thread with your question so we don't pollute this thread for
Kartik. Can you give some samples as well? I'm not sure I understood your
question.


On Mon, May 12, 2014 at 3:05 AM, kartik manocha <ko...@gmail.com>wrote:

> Pradeep,
>
> Thanks for the pointers, but as i mentioned that I need to extract that
> string till semicolon, so facing issues with that.
>
> I need to print it before semiclon that's causing pain as when I mention
> semicolon in regex it treats it as end of statement & produces error.
>
> However without mentioning semicolon it works fine but produces complete
> stuff starting with B75.
> eg .
> B=foreach D generate REGEX_EXTRACT(test,'(B75.*)',1);
>
> Is there any way by which I can mention semicolon in my above regex, so
> that it prints the string before that.
>
>
> Thanks,
> Kartik
>
>
>
> On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota <pradeepg26@gmail.com
> >wrote:
>
> > Check out
> > http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT
> >
> > This may suit your needs
> >
> >
> > On Mon, May 12, 2014 at 12:16 AM, kartik manocha <koolkartik87@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I am new to pig & facing an issue in filtering out a string from a
> field,
> > > mentioned is the scenario.
> > >
> > > - > I am loading data with several fields, among those fields there is
> > > field name called 'test_data'
> > > - > There are lot of things in this field, I wanted to filter out a
> > string
> > > from this field which starts from B75 & ends with semi colon.
> > > - > After taking this string out, wanted to add this as a new field to
> > the
> > > existing bag which was loaded
> > >
> > > I tried using INDEXOF UDF, but that works for a single character only,
> > > however when I tried using that for single character, it returns ()
> only
> > > instead of index number. I was just testing, & by manually providing
> > > indexes in SUBSTRING UDF, it was generating string.
> > >
> > > But unable to get the position using indexof UDF, or may be there could
> > be
> > > a better of doing this.
> > >
> > > If you have any pointers / suggestions, please share.
> > >
> > > Thanks in advance.
> > >
> > >
> > > Best,
> > > Kartik
> > >
> >
>

Re: Query : Filtering out string from a field

Posted by kartik manocha <ko...@gmail.com>.
Pradeep,

Thanks for the pointers, but as i mentioned that I need to extract that
string till semicolon, so facing issues with that.

I need to print it before semiclon that's causing pain as when I mention
semicolon in regex it treats it as end of statement & produces error.

However without mentioning semicolon it works fine but produces complete
stuff starting with B75.
eg .
B=foreach D generate REGEX_EXTRACT(test,'(B75.*)',1);

Is there any way by which I can mention semicolon in my above regex, so
that it prints the string before that.


Thanks,
Kartik



On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Check out
> http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT
>
> This may suit your needs
>
>
> On Mon, May 12, 2014 at 12:16 AM, kartik manocha <koolkartik87@gmail.com
> >wrote:
>
> > Hi,
> >
> > I am new to pig & facing an issue in filtering out a string from a field,
> > mentioned is the scenario.
> >
> > - > I am loading data with several fields, among those fields there is
> > field name called 'test_data'
> > - > There are lot of things in this field, I wanted to filter out a
> string
> > from this field which starts from B75 & ends with semi colon.
> > - > After taking this string out, wanted to add this as a new field to
> the
> > existing bag which was loaded
> >
> > I tried using INDEXOF UDF, but that works for a single character only,
> > however when I tried using that for single character, it returns () only
> > instead of index number. I was just testing, & by manually providing
> > indexes in SUBSTRING UDF, it was generating string.
> >
> > But unable to get the position using indexof UDF, or may be there could
> be
> > a better of doing this.
> >
> > If you have any pointers / suggestions, please share.
> >
> > Thanks in advance.
> >
> >
> > Best,
> > Kartik
> >
>

Re: Query : Filtering out string from a field

Posted by Pradeep Gollakota <pr...@gmail.com>.
Check out
http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT

This may suit your needs


On Mon, May 12, 2014 at 12:16 AM, kartik manocha <ko...@gmail.com>wrote:

> Hi,
>
> I am new to pig & facing an issue in filtering out a string from a field,
> mentioned is the scenario.
>
> - > I am loading data with several fields, among those fields there is
> field name called 'test_data'
> - > There are lot of things in this field, I wanted to filter out a string
> from this field which starts from B75 & ends with semi colon.
> - > After taking this string out, wanted to add this as a new field to the
> existing bag which was loaded
>
> I tried using INDEXOF UDF, but that works for a single character only,
> however when I tried using that for single character, it returns () only
> instead of index number. I was just testing, & by manually providing
> indexes in SUBSTRING UDF, it was generating string.
>
> But unable to get the position using indexof UDF, or may be there could be
> a better of doing this.
>
> If you have any pointers / suggestions, please share.
>
> Thanks in advance.
>
>
> Best,
> Kartik
>

Re: Query : Filtering out string from a field

Posted by unmesha sreeveni <un...@gmail.com>.
Hi Karthik
 I am also having the same issue.
Mine is I want to filter a column instead of a field and join to another
file into a specific index.


I think someone here can suggest a better solution.




On Mon, May 12, 2014 at 12:46 PM, kartik manocha <ko...@gmail.com>wrote:

> Hi,
>
> I am new to pig & facing an issue in filtering out a string from a field,
> mentioned is the scenario.
>
> - > I am loading data with several fields, among those fields there is
> field name called 'test_data'
> - > There are lot of things in this field, I wanted to filter out a string
> from this field which starts from B75 & ends with semi colon.
> - > After taking this string out, wanted to add this as a new field to the
> existing bag which was loaded
>
> I tried using INDEXOF UDF, but that works for a single character only,
> however when I tried using that for single character, it returns () only
> instead of index number. I was just testing, & by manually providing
> indexes in SUBSTRING UDF, it was generating string.
>
> But unable to get the position using indexof UDF, or may be there could be
> a better of doing this.
>
> If you have any pointers / suggestions, please share.
>
> Thanks in advance.
>
>
> Best,
> Kartik
>



-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/