You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Adam Phelps <am...@opendns.com> on 2011/05/23 22:03:48 UTC

Hive question, summing second-level domain names

(As an FYI I'm relatively new to Hive and have no previous SQL 
experience, so have been struggling a bit with the Language manual which 
seems to assume previous SQL experience)

Suppose I have a table, within which there is a column which contains 
domain names (ie such as hadoop.apache.org).  I want to perform a count 
of all second-level domains, ie hadoop.apache.org and hive.apache.org 
would count in the same bucket.

Now I could count things for a particular second-level domain like this:

SELECT
   year, month, day, hour, COUNT(1) as count
FROM
   domainlog
WHERE
   year = 2011 AND
   month = 05 AND
   day = 15 AND
   (
     domain RLIKE ".*[.]apache[.]org"
   )
GROUP BY
   year, month, day, hour

however I'm not seeing a way to sum up all second-level domains rather 
than a particular one.  I basically want to group everything using a 
regular expression along the lines of ".*[.][^.]*[.][^.]*" and then 
output lines with a count for the common portion.  Any pointers in the 
correct direction would be welcome.

Thanks
- Adam

Re: Hive question, summing second-level domain names

Posted by Adam Phelps <am...@opendns.com>.

It does seem like that command is approximately what I need, but I can't 
seem to get its usage quite right for what I'm attempting to do.

As an example, here's a try where I'm narrowing it down to only 
subdomains of facebook.com:

SELECT
   regexp_extract(qname, '^((.*[.])|(.{0}))[^.]+\\.[^.]+\\.', 1), 
COUNT(1) as count
FROM
   querylog
WHERE
   year = 2011 AND
   month = 05 AND
   day = 15 AND
   hour = 00 AND
   qname RLIKE "^((.*[.])|(.{0}))facebook[.]com[.]$"
GROUP BY
   regexp_extract(qname, '^((.*[.])|(.{0}))[^.]+\\.[^.]+\\.', 1)

What I want to output in this case is something like
facebook.com  XXXXXX

What I actually get goes like this:

0-101.channel. 52
0-105.channel. 50
0-109.channel. 70
0-11.channel. 96
0-112.channel. 52
0-116.channel. 53
0-123.channel. 48

Which is actually summing up every subdomain of facebook.com, rather 
than the second-level domain itself.  I tried using different int values 
for the second argument of the regexp_extract() function but didn't get 
one that did what I need.

Thanks
- Adam


On 5/23/11 1:11 PM, Loren Siebert wrote:
> Use regexp_extract() and group on that.
> 	http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions
>
> For example, to get class C IP block from an IP address in a column called ‘host':
>
>      regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc
>
>
>
>
> On May 23, 2011, at 1:03 PM, Adam Phelps wrote:
>
>> (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)
>>
>> Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org).  I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.
>>
>> Now I could count things for a particular second-level domain like this:
>>
>> SELECT
>>   year, month, day, hour, COUNT(1) as count
>> FROM
>>   domainlog
>> WHERE
>>   year = 2011 AND
>>   month = 05 AND
>>   day = 15 AND
>>   (
>>     domain RLIKE ".*[.]apache[.]org"
>>   )
>> GROUP BY
>>   year, month, day, hour
>>
>> however I'm not seeing a way to sum up all second-level domains rather than a particular one.  I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion.  Any pointers in the correct direction would be welcome.
>>
>> Thanks
>> - Adam
>

Re: Hive question, summing second-level domain names

Posted by Loren Siebert <lo...@siebert.org>.

Use regexp_extract() and group on that.
	http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions

For example, to get class C IP block from an IP address in a column called ‘host':

    regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc

   


On May 23, 2011, at 1:03 PM, Adam Phelps wrote:

> (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)
> 
> Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org).  I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.
> 
> Now I could count things for a particular second-level domain like this:
> 
> SELECT
>  year, month, day, hour, COUNT(1) as count
> FROM
>  domainlog
> WHERE
>  year = 2011 AND
>  month = 05 AND
>  day = 15 AND
>  (
>    domain RLIKE ".*[.]apache[.]org"
>  )
> GROUP BY
>  year, month, day, hour
> 
> however I'm not seeing a way to sum up all second-level domains rather than a particular one.  I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion.  Any pointers in the correct direction would be welcome.
> 
> Thanks
> - Adam

RE: hive storing a byte array

Posted by Steven Wong <sw...@netflix.com>.

I claim no experience in storing blobs in Hive, but it sounds to me that using array/list will be quite inefficient, in terms of both size and run time.

-----Original Message-----
From: Luke Forehand [mailto:luke.forehand@networkedinsights.com] 
Sent: Tuesday, May 24, 2011 7:31 AM
To: user@hive.apache.org
Subject: Re: hive storing a byte array

Steven,

Thanks for your reply!  I have written it the way you mentioned, based on
an earlier post in this mailing list.  I'm concerned about having to
encode/decode the string in base64, I'm wondering how much this will
impact my job run time.

I have also written a UDF that emits a byte array, stored in a field of
type array<tinyint>.  When reading this field, the ObjectInspector is a
ListObjectInspector with primitiveJavaByte for the list elements.  Reading
this field in the UDF seems clunky because I have to iterate over the
list, reading each byte into a byte array, before I can use it.

Given both approaches, which one do you think has the least performance
overhead?

Thanks,
Luke

On 5/23/11 6:59 PM, "Steven Wong" <sw...@netflix.com> wrote:

>Hive does not support the blob data type. An option is to store your
>binary data encoded as string (such as using base64) and define them in
>Hive as string.
>
>
>-----Original Message-----
>From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
>Sent: Monday, May 23, 2011 1:21 PM
>To: user@hive.apache.org
>Subject: hive storing a byte array
>
>Hello,
>
>Can someone please provide an example in Hive, how I can store a
>serialized object in a field?  A field type of byte array or binary or
>blob is really what I was looking for, but if something slightly less
>trivial is involved some instruction would be much appreciated.  This
>object is used in a custom UDF later on in the processing pipeline.
>
>-Luke
>
>

Re: hive storing a byte array

Posted by Luke Forehand <lu...@networkedinsights.com>.

Steven,

Thanks for your reply!  I have written it the way you mentioned, based on
an earlier post in this mailing list.  I'm concerned about having to
encode/decode the string in base64, I'm wondering how much this will
impact my job run time.

I have also written a UDF that emits a byte array, stored in a field of
type array<tinyint>.  When reading this field, the ObjectInspector is a
ListObjectInspector with primitiveJavaByte for the list elements.  Reading
this field in the UDF seems clunky because I have to iterate over the
list, reading each byte into a byte array, before I can use it.

Given both approaches, which one do you think has the least performance
overhead?

Thanks,
Luke

On 5/23/11 6:59 PM, "Steven Wong" <sw...@netflix.com> wrote:

>Hive does not support the blob data type. An option is to store your
>binary data encoded as string (such as using base64) and define them in
>Hive as string.
>
>
>-----Original Message-----
>From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
>Sent: Monday, May 23, 2011 1:21 PM
>To: user@hive.apache.org
>Subject: hive storing a byte array
>
>Hello,
>
>Can someone please provide an example in Hive, how I can store a
>serialized object in a field?  A field type of byte array or binary or
>blob is really what I was looking for, but if something slightly less
>trivial is involved some instruction would be much appreciated.  This
>object is used in a custom UDF later on in the processing pipeline.
>
>-Luke
>
>

RE: hive storing a byte array

Posted by Steven Wong <sw...@netflix.com>.

Hive does not support the blob data type. An option is to store your binary data encoded as string (such as using base64) and define them in Hive as string.

-----Original Message-----
From: Luke Forehand [mailto:luke.forehand@networkedinsights.com] 
Sent: Monday, May 23, 2011 1:21 PM
To: user@hive.apache.org
Subject: hive storing a byte array

Hello,

Can someone please provide an example in Hive, how I can store a
serialized object in a field?  A field type of byte array or binary or
blob is really what I was looking for, but if something slightly less
trivial is involved some instruction would be much appreciated.  This
object is used in a custom UDF later on in the processing pipeline.

-Luke

hive storing a byte array

Posted by Luke Forehand <lu...@networkedinsights.com>.

Hello,

Can someone please provide an example in Hive, how I can store a
serialized object in a field?  A field type of byte array or binary or
blob is really what I was looking for, but if something slightly less
trivial is involved some instruction would be much appreciated.  This
object is used in a custom UDF later on in the processing pipeline.

-Luke