You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by bichonfrise74 <bi...@gmail.com> on 2011/03/18 23:40:22 UTC

Apache Web Log Question

Hi,

I am trying to use this:

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^
\"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;


And it works great. My problem is how query the table with respect to the
time column since it still has this format: [01/Mar/2011:13:01:10 -0700]

So, I do not know how to execute these kinds of queries:

1. select time, count(*) from weblog where time < '01/Mar/2010';
2. select hours, count(*) from weblog group by hours;

Re: Apache Web Log Question

Posted by Loren Siebert <lo...@siebert.org>.

Ty this:
hive > select time, to_date(from_unixtime(unix_timestamp(regexp_extract(time, '^.([^:]*):.*', 1),"dd/MMM/yyyy"))) yyyymmdd from apachelog limit 10; 

You can do the same sort of thing to get the hhmmss into a format that is more lexicographically friendly.

There’s more info on these functions here: http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

-Loren

On Mar 18, 2011, at 3:40 PM, bichonfrise74 wrote:

> Hi, 
> 
> I am trying to use this:
> 
> add jar ../build/contrib/hive_contrib.jar;
> 
> CREATE TABLE apachelog (
>   host STRING,
>   identity STRING,
>   user STRING,
>   time STRING,
>   request STRING,
>   status STRING,
>   size STRING,
>   referer STRING,
>   agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> )
> STORED AS TEXTFILE;
> 
> And it works great. My problem is how query the table with respect to the time column since it still has this format: [01/Mar/2011:13:01:10 -0700]  
> 
> So, I do not know how to execute these kinds of queries:
> 
> 1. select time, count(*) from weblog where time < '01/Mar/2010';
> 2. select hours, count(*) from weblog group by hours;