You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Adam Phelps <am...@opendns.com> on 2011/05/23 22:03:48 UTC
Hive question, summing second-level domain names
(As an FYI I'm relatively new to Hive and have no previous SQL
experience, so have been struggling a bit with the Language manual which
seems to assume previous SQL experience)
Suppose I have a table, within which there is a column which contains
domain names (ie such as hadoop.apache.org). I want to perform a count
of all second-level domains, ie hadoop.apache.org and hive.apache.org
would count in the same bucket.
Now I could count things for a particular second-level domain like this:
SELECT
year, month, day, hour, COUNT(1) as count
FROM
domainlog
WHERE
year = 2011 AND
month = 05 AND
day = 15 AND
(
domain RLIKE ".*[.]apache[.]org"
)
GROUP BY
year, month, day, hour
however I'm not seeing a way to sum up all second-level domains rather
than a particular one. I basically want to group everything using a
regular expression along the lines of ".*[.][^.]*[.][^.]*" and then
output lines with a count for the common portion. Any pointers in the
correct direction would be welcome.
Thanks
- Adam
Re: Hive question, summing second-level domain names
Posted by Adam Phelps <am...@opendns.com>.
It does seem like that command is approximately what I need, but I can't
seem to get its usage quite right for what I'm attempting to do.
As an example, here's a try where I'm narrowing it down to only
subdomains of facebook.com:
SELECT
regexp_extract(qname, '^((.*[.])|(.{0}))[^.]+\\.[^.]+\\.', 1),
COUNT(1) as count
FROM
querylog
WHERE
year = 2011 AND
month = 05 AND
day = 15 AND
hour = 00 AND
qname RLIKE "^((.*[.])|(.{0}))facebook[.]com[.]$"
GROUP BY
regexp_extract(qname, '^((.*[.])|(.{0}))[^.]+\\.[^.]+\\.', 1)
What I want to output in this case is something like
facebook.com XXXXXX
What I actually get goes like this:
0-101.channel. 52
0-105.channel. 50
0-109.channel. 70
0-11.channel. 96
0-112.channel. 52
0-116.channel. 53
0-123.channel. 48
Which is actually summing up every subdomain of facebook.com, rather
than the second-level domain itself. I tried using different int values
for the second argument of the regexp_extract() function but didn't get
one that did what I need.
Thanks
- Adam
On 5/23/11 1:11 PM, Loren Siebert wrote:
> Use regexp_extract() and group on that.
> http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions
>
> For example, to get class C IP block from an IP address in a column called ‘host':
>
> regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc
>
>
>
>
> On May 23, 2011, at 1:03 PM, Adam Phelps wrote:
>
>> (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)
>>
>> Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org). I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.
>>
>> Now I could count things for a particular second-level domain like this:
>>
>> SELECT
>> year, month, day, hour, COUNT(1) as count
>> FROM
>> domainlog
>> WHERE
>> year = 2011 AND
>> month = 05 AND
>> day = 15 AND
>> (
>> domain RLIKE ".*[.]apache[.]org"
>> )
>> GROUP BY
>> year, month, day, hour
>>
>> however I'm not seeing a way to sum up all second-level domains rather than a particular one. I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion. Any pointers in the correct direction would be welcome.
>>
>> Thanks
>> - Adam
>
Re: Hive question, summing second-level domain names
Posted by Loren Siebert <lo...@siebert.org>.
Use regexp_extract() and group on that.
http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions
For example, to get class C IP block from an IP address in a column called ‘host':
regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc
On May 23, 2011, at 1:03 PM, Adam Phelps wrote:
> (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)
>
> Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org). I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.
>
> Now I could count things for a particular second-level domain like this:
>
> SELECT
> year, month, day, hour, COUNT(1) as count
> FROM
> domainlog
> WHERE
> year = 2011 AND
> month = 05 AND
> day = 15 AND
> (
> domain RLIKE ".*[.]apache[.]org"
> )
> GROUP BY
> year, month, day, hour
>
> however I'm not seeing a way to sum up all second-level domains rather than a particular one. I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion. Any pointers in the correct direction would be welcome.
>
> Thanks
> - Adam
RE: hive storing a byte array
Posted by Steven Wong <sw...@netflix.com>.
I claim no experience in storing blobs in Hive, but it sounds to me that using array/list will be quite inefficient, in terms of both size and run time.
-----Original Message-----
From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
Sent: Tuesday, May 24, 2011 7:31 AM
To: user@hive.apache.org
Subject: Re: hive storing a byte array
Steven,
Thanks for your reply! I have written it the way you mentioned, based on
an earlier post in this mailing list. I'm concerned about having to
encode/decode the string in base64, I'm wondering how much this will
impact my job run time.
I have also written a UDF that emits a byte array, stored in a field of
type array<tinyint>. When reading this field, the ObjectInspector is a
ListObjectInspector with primitiveJavaByte for the list elements. Reading
this field in the UDF seems clunky because I have to iterate over the
list, reading each byte into a byte array, before I can use it.
Given both approaches, which one do you think has the least performance
overhead?
Thanks,
Luke
On 5/23/11 6:59 PM, "Steven Wong" <sw...@netflix.com> wrote:
>Hive does not support the blob data type. An option is to store your
>binary data encoded as string (such as using base64) and define them in
>Hive as string.
>
>
>-----Original Message-----
>From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
>Sent: Monday, May 23, 2011 1:21 PM
>To: user@hive.apache.org
>Subject: hive storing a byte array
>
>Hello,
>
>Can someone please provide an example in Hive, how I can store a
>serialized object in a field? A field type of byte array or binary or
>blob is really what I was looking for, but if something slightly less
>trivial is involved some instruction would be much appreciated. This
>object is used in a custom UDF later on in the processing pipeline.
>
>-Luke
>
>
Re: hive storing a byte array
Posted by Luke Forehand <lu...@networkedinsights.com>.
Steven,
Thanks for your reply! I have written it the way you mentioned, based on
an earlier post in this mailing list. I'm concerned about having to
encode/decode the string in base64, I'm wondering how much this will
impact my job run time.
I have also written a UDF that emits a byte array, stored in a field of
type array<tinyint>. When reading this field, the ObjectInspector is a
ListObjectInspector with primitiveJavaByte for the list elements. Reading
this field in the UDF seems clunky because I have to iterate over the
list, reading each byte into a byte array, before I can use it.
Given both approaches, which one do you think has the least performance
overhead?
Thanks,
Luke
On 5/23/11 6:59 PM, "Steven Wong" <sw...@netflix.com> wrote:
>Hive does not support the blob data type. An option is to store your
>binary data encoded as string (such as using base64) and define them in
>Hive as string.
>
>
>-----Original Message-----
>From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
>Sent: Monday, May 23, 2011 1:21 PM
>To: user@hive.apache.org
>Subject: hive storing a byte array
>
>Hello,
>
>Can someone please provide an example in Hive, how I can store a
>serialized object in a field? A field type of byte array or binary or
>blob is really what I was looking for, but if something slightly less
>trivial is involved some instruction would be much appreciated. This
>object is used in a custom UDF later on in the processing pipeline.
>
>-Luke
>
>
RE: hive storing a byte array
Posted by Steven Wong <sw...@netflix.com>.
Hive does not support the blob data type. An option is to store your binary data encoded as string (such as using base64) and define them in Hive as string.
-----Original Message-----
From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
Sent: Monday, May 23, 2011 1:21 PM
To: user@hive.apache.org
Subject: hive storing a byte array
Hello,
Can someone please provide an example in Hive, how I can store a
serialized object in a field? A field type of byte array or binary or
blob is really what I was looking for, but if something slightly less
trivial is involved some instruction would be much appreciated. This
object is used in a custom UDF later on in the processing pipeline.
-Luke
hive storing a byte array
Posted by Luke Forehand <lu...@networkedinsights.com>.
Hello,
Can someone please provide an example in Hive, how I can store a
serialized object in a field? A field type of byte array or binary or
blob is really what I was looking for, but if something slightly less
trivial is involved some instruction would be much appreciated. This
object is used in a custom UDF later on in the processing pipeline.
-Luke