You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Seth Falcon <se...@userprimary.net> on 2009/06/17 19:34:40 UTC
using couchdb for access log analysis
Hi all,
I've been exploring using couchdb to store aggregated access log
data. I would really appreciate a bit of feedback on the approach
I've taken.
First, the problem I'm trying solve. The input data consists of web
server access logs. I want to generate a report of how many times
each page was viewed over different time windows. For example, I'd
like to be able to request the report for the day 2009-06-16 as well
as a particular hour in a given day.
After reading the wiki page about stats aggregation, I started by
pre-reducing into one minute chunks. Each one minutes worth of log
data results in a list of pages and view counts over that minute. For
each page/count pair I insert a document like the following into
couchdb:
{
"_id" : "2009-06-16T13h30_abcdefg",
"url" : "/foo/bar/baz",
"view_count" : 13
}
The ID is the timestamp representing the minute chunk prepended to the
md5 digest of the url.
One idea for a view that will allow querying different time units is
to emit keys like the following for the above example doc:
["hour", "2009-06-16T13", "abcdefg"]
["day", "2009-06-16", "abcdefg"]
["month", "2009-06", "abcdefg"]
Then one can query with startkey=["hour", "2009-06-16T13", true] and
endkey=["hour", "2009-06-16T13", {}] to get a particular hour using
group=true and a reduce function that sums the view_count.
Does this seem like a reasonable approach? Would it be better to
create separate views for hour, day, and month and avoid the
array-valued keys? In a previous post, I goofed up in understanding
how startkey/endkey queries work. Am I making a similar error in
thinking with the above array-valued key approach? I'm thinking this
is a different case because the time unit is always an exact match in
startkey and endkey.
Anyhow, I'd really appreciate any suggestions for improvement or
confirmation that this should kinda-sorta work.
Cheers,
+ seth
Re: using couchdb for access log analysis
Posted by Blair Nilsson <bl...@gmail.com>.
we are doing something pretty much the same here, doing aggregation of
data (we are just after the max values, but you can use it for pretty
much anything) for a bunch of solar water heating units.
overall it works just fine :)
Here are the maps and reduces, you may want to adapt them or write your own.
the map...
function(doc) {
var displayDate = function(date) {
var year = date.getFullYear();
var month = date.getMonth()+1;
month = ((month < 10) ? "0" : "") + month
var day = date.getDate();
day = ((day < 10) ? "0" : "") + day
var hours = date.getHours()
hours = ((hours < 10) ? "0" : "") + hours
var minutes = date.getMinutes()
minutes=((minutes < 10) ? "0" : "") + minutes
var seconds = date.getSeconds()
seconds=((seconds < 10) ? "0" : "") + seconds
return year+"/"+month+"/"+day+" "+hours+":" + minutes+":"+seconds
}
if (doc.type="Solar Reading") {
var date = new Date(doc.date)
date.setSeconds(0)
emit([doc.site, 1, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
date.setMinutes(Math.floor(date.getMinutes()/10) * 10)
emit([doc.site, 2, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
date.setMinutes(0)
emit([doc.site, 3, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
date.setHours(0)
emit([doc.site, 4, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
date.setDate(1)
emit([doc.site, 5, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
}
}
the reduce... - note, you end up in rereduce land pretty quickly as we
found, happily, this is just fine for it.
function(keys, values, rereduce) {
var panel = values[0][0]
var inlet = values[0][1]
var outlet = values[0][2]
for (i in values) {
if (values[i][0] > panel) {
panel = values[i][0]
}
if (values[i][1] > inlet) {
inlet = values[i][1]
}
if (values[i][2] > outlet) {
outlet = values[i][2]
}
}
return [panel, inlet, outlet]
}
On Thu, Jun 18, 2009 at 5:34 AM, Seth Falcon<se...@userprimary.net> wrote:
> Hi all,
>
> I've been exploring using couchdb to store aggregated access log
> data. I would really appreciate a bit of feedback on the approach
> I've taken.
>
> First, the problem I'm trying solve. The input data consists of web
> server access logs. I want to generate a report of how many times
> each page was viewed over different time windows. For example, I'd
> like to be able to request the report for the day 2009-06-16 as well
> as a particular hour in a given day.
>
> After reading the wiki page about stats aggregation, I started by
> pre-reducing into one minute chunks. Each one minutes worth of log
> data results in a list of pages and view counts over that minute. For
> each page/count pair I insert a document like the following into
> couchdb:
>
> {
> "_id" : "2009-06-16T13h30_abcdefg",
> "url" : "/foo/bar/baz",
> "view_count" : 13
> }
>
> The ID is the timestamp representing the minute chunk prepended to the
> md5 digest of the url.
>
> One idea for a view that will allow querying different time units is
> to emit keys like the following for the above example doc:
>
> ["hour", "2009-06-16T13", "abcdefg"]
> ["day", "2009-06-16", "abcdefg"]
> ["month", "2009-06", "abcdefg"]
>
> Then one can query with startkey=["hour", "2009-06-16T13", true] and
> endkey=["hour", "2009-06-16T13", {}] to get a particular hour using
> group=true and a reduce function that sums the view_count.
>
> Does this seem like a reasonable approach? Would it be better to
> create separate views for hour, day, and month and avoid the
> array-valued keys? In a previous post, I goofed up in understanding
> how startkey/endkey queries work. Am I making a similar error in
> thinking with the above array-valued key approach? I'm thinking this
> is a different case because the time unit is always an exact match in
> startkey and endkey.
>
> Anyhow, I'd really appreciate any suggestions for improvement or
> confirmation that this should kinda-sorta work.
>
> Cheers,
>
> + seth
>
I'm not sure if what we are doing is better or worse, but here is the
approch we are using.