You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Andraz Tori <an...@zemanta.com> on 2009/07/27 20:35:23 UTC

counting different regexes in a single pass

Hi, 
I am a beginner at Hive's SQL so I am sorry if this question is answered
somewhere else. I tried to find the answer in Wiki, but no luck.

I have a dataset in which one of the columns is text. I need to count
number of records that match certain regex on that column. There are a
number of different regexes that I need to count records for and it
seems that there should be a way to do it in one pass through the data
with hive. 

So how do I do it?

This doesn't seem to work (on Hive 0.3):
select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
LIKE "%%othertext%%") as CountB from sometable;


Thank you for your help in advance!


[unrelated: We've written a simple AWS S3 log format deserializer for
Hive. It needs some polishing - if anyone needs it or wants to polish it
up for inclusion, let me know]

-- 
Andraz Tori, CTO
Zemanta Ltd, New York, London, Ljubljana
www.zemanta.com
mail: andraz@zemanta.com
tel: +386 41 515 767
twitter: andraz, skype: minmax_test

Re: counting different regexes in a single pass

Posted by Saurabh Nanda <sa...@gmail.com>.

I think you can do that with regex replace and GROUP BY. Something
like this- select replaced_col, count(1) from (select
regex_replace(original_col, '.*(text1|text2|text3).*', '$1') as
replaced_col from table)

On 7/28/09, Andraz Tori <an...@zemanta.com> wrote:
> Hi,
> I am a beginner at Hive's SQL so I am sorry if this question is answered
> somewhere else. I tried to find the answer in Wiki, but no luck.
>
> I have a dataset in which one of the columns is text. I need to count
> number of records that match certain regex on that column. There are a
> number of different regexes that I need to count records for and it
> seems that there should be a way to do it in one pass through the data
> with hive.
>
> So how do I do it?
>
> This doesn't seem to work (on Hive 0.3):
> select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
> LIKE "%%othertext%%") as CountB from sometable;
>
>
> Thank you for your help in advance!
>
>
> [unrelated: We've written a simple AWS S3 log format deserializer for
> Hive. It needs some polishing - if anyone needs it or wants to polish it
> up for inclusion, let me know]
>
> --
> Andraz Tori, CTO
> Zemanta Ltd, New York, London, Ljubljana
> www.zemanta.com
> mail: andraz@zemanta.com
> tel: +386 41 515 767
> twitter: andraz, skype: minmax_test
>
>
>
>


-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: counting different regexes in a single pass

Posted by David Lerman <dl...@videoegg.com>.

Saraubh, seems like this approach would fail if a particular line might
match more than one regex, for example if you're looking for lines that
match "dog" or "good" and one line contains "dogood".

Andraz, have you tried something like:

select sum(if(textcolumn like "...", 1, 0)) as CountA, sum(if(textcolumn
like "...", 1, 0)) as CountB from sometable;

On 7/27/09 3:00 PM, "Saurabh Nanda" <sa...@gmail.com> wrote:

> I think you can do that with regex replace and GROUP BY. Something
> like this- select replaced_col, count(1) from (select
> regex_replace(original_col, '.*(text1|text2|text3).*', '$1') as
> replaced_col from table)  a group by a.replaced_col
> 
> On 7/28/09, Andraz Tori <an...@zemanta.com> wrote:
>> Hi,
>> I am a beginner at Hive's SQL so I am sorry if this question is answered
>> somewhere else. I tried to find the answer in Wiki, but no luck.
>> 
>> I have a dataset in which one of the columns is text. I need to count
>> number of records that match certain regex on that column. There are a
>> number of different regexes that I need to count records for and it
>> seems that there should be a way to do it in one pass through the data
>> with hive.
>> 
>> So how do I do it?
>> 
>> This doesn't seem to work (on Hive 0.3):
>> select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
>> LIKE "%%othertext%%") as CountB from sometable;
>> 
>> 
>> Thank you for your help in advance!
>> 
>> 
>> [unrelated: We've written a simple AWS S3 log format deserializer for
>> Hive. It needs some polishing - if anyone needs it or wants to polish it
>> up for inclusion, let me know]
>> 
>> --
>> Andraz Tori, CTO
>> Zemanta Ltd, New York, London, Ljubljana
>> www.zemanta.com
>> mail: andraz@zemanta.com
>> tel: +386 41 515 767
>> twitter: andraz, skype: minmax_test
>> 
>> 
>> 
>> 
> 
> 
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com

Re: counting different regexes in a single pass

Posted by Saurabh Nanda <sa...@gmail.com>.

I think you can do that with regex replace and GROUP BY. Something
like this- select replaced_col, count(1) from (select
regex_replace(original_col, '.*(text1|text2|text3).*', '$1') as
replaced_col from table)  a group by a.replaced_col

On 7/28/09, Andraz Tori <an...@zemanta.com> wrote:
> Hi,
> I am a beginner at Hive's SQL so I am sorry if this question is answered
> somewhere else. I tried to find the answer in Wiki, but no luck.
>
> I have a dataset in which one of the columns is text. I need to count
> number of records that match certain regex on that column. There are a
> number of different regexes that I need to count records for and it
> seems that there should be a way to do it in one pass through the data
> with hive.
>
> So how do I do it?
>
> This doesn't seem to work (on Hive 0.3):
> select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
> LIKE "%%othertext%%") as CountB from sometable;
>
>
> Thank you for your help in advance!
>
>
> [unrelated: We've written a simple AWS S3 log format deserializer for
> Hive. It needs some polishing - if anyone needs it or wants to polish it
> up for inclusion, let me know]
>
> --
> Andraz Tori, CTO
> Zemanta Ltd, New York, London, Ljubljana
> www.zemanta.com
> mail: andraz@zemanta.com
> tel: +386 41 515 767
> twitter: andraz, skype: minmax_test
>
>
>
>


-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: counting different regexes in a single pass

Posted by Zheng Shao <zs...@gmail.com>.

Hi Andraz,

I just opened a JIRA for AWS S3 log format.
Can you attach a patch file to: https://issues.apache.org/jira/browse/HIVE-693 ?

For your question, I think the approach suggested by David Lerman
should work fine.

Thanks,
Zheng

On Mon, Jul 27, 2009 at 11:35 AM, Andraz Tori<an...@zemanta.com> wrote:
> Hi,
> I am a beginner at Hive's SQL so I am sorry if this question is answered
> somewhere else. I tried to find the answer in Wiki, but no luck.
>
> I have a dataset in which one of the columns is text. I need to count
> number of records that match certain regex on that column. There are a
> number of different regexes that I need to count records for and it
> seems that there should be a way to do it in one pass through the data
> with hive.
>
> So how do I do it?
>
> This doesn't seem to work (on Hive 0.3):
> select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
> LIKE "%%othertext%%") as CountB from sometable;
>
>
> Thank you for your help in advance!
>
>
> [unrelated: We've written a simple AWS S3 log format deserializer for
> Hive. It needs some polishing - if anyone needs it or wants to polish it
> up for inclusion, let me know]
>
> --
> Andraz Tori, CTO
> Zemanta Ltd, New York, London, Ljubljana
> www.zemanta.com
> mail: andraz@zemanta.com
> tel: +386 41 515 767
> twitter: andraz, skype: minmax_test
>
>
>
>



-- 
Yours,
Zheng