You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by bichonfrise74 <bi...@gmail.com> on 2011/03/18 23:40:22 UTC
Apache Web Log Question
Hi,
I am trying to use this:
add jar ../build/contrib/hive_contrib.jar;
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^
\"]*|\".*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
And it works great. My problem is how query the table with respect to the
time column since it still has this format: [01/Mar/2011:13:01:10 -0700]
So, I do not know how to execute these kinds of queries:
1. select time, count(*) from weblog where time < '01/Mar/2010';
2. select hours, count(*) from weblog group by hours;
Re: Apache Web Log Question
Posted by Loren Siebert <lo...@siebert.org>.
Ty this:
hive > select time, to_date(from_unixtime(unix_timestamp(regexp_extract(time, '^.([^:]*):.*', 1),"dd/MMM/yyyy"))) yyyymmdd from apachelog limit 10;
You can do the same sort of thing to get the hhmmss into a format that is more lexicographically friendly.
There’s more info on these functions here: http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF
-Loren
On Mar 18, 2011, at 3:40 PM, bichonfrise74 wrote:
> Hi,
>
> I am trying to use this:
>
> add jar ../build/contrib/hive_contrib.jar;
>
> CREATE TABLE apachelog (
> host STRING,
> identity STRING,
> user STRING,
> time STRING,
> request STRING,
> status STRING,
> size STRING,
> referer STRING,
> agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
> "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
> "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> )
> STORED AS TEXTFILE;
>
> And it works great. My problem is how query the table with respect to the time column since it still has this format: [01/Mar/2011:13:01:10 -0700]
>
> So, I do not know how to execute these kinds of queries:
>
> 1. select time, count(*) from weblog where time < '01/Mar/2010';
> 2. select hours, count(*) from weblog group by hours;