You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Seth Falcon <se...@userprimary.net> on 2009/06/17 19:34:40 UTC

using couchdb for access log analysis

Hi all,

I've been exploring using couchdb to store aggregated access log
data.  I would really appreciate a bit of feedback on the approach
I've taken.

First, the problem I'm trying solve.  The input data consists of web
server access logs.  I want to generate a report of how many times
each page was viewed over different time windows.  For example, I'd
like to be able to request the report for the day 2009-06-16 as well
as a particular hour in a given day.

After reading the wiki page about stats aggregation, I started by
pre-reducing into one minute chunks.  Each one minutes worth of log
data results in a list of pages and view counts over that minute.  For
each page/count pair I insert a document like the following into
couchdb:

     { 
        "_id" : "2009-06-16T13h30_abcdefg",
        "url" : "/foo/bar/baz",
        "view_count" : 13
     }

The ID is the timestamp representing the minute chunk prepended to the
md5 digest of the url.

One idea for a view that will allow querying different time units is
to emit keys like the following for the above example doc:

    ["hour",  "2009-06-16T13", "abcdefg"]
    ["day",   "2009-06-16",    "abcdefg"]
    ["month", "2009-06",       "abcdefg"]

Then one can query with startkey=["hour", "2009-06-16T13", true] and
endkey=["hour", "2009-06-16T13", {}] to get a particular hour using
group=true and a reduce function that sums the view_count.

Does this seem like a reasonable approach?  Would it be better to
create separate views for hour, day, and month and avoid the
array-valued keys?  In a previous post, I goofed up in understanding
how startkey/endkey queries work.  Am I making a similar error in
thinking with the above array-valued key approach?  I'm thinking this
is a different case because the time unit is always an exact match in
startkey and endkey.

Anyhow, I'd really appreciate any suggestions for improvement or
confirmation that this should kinda-sorta work.

Cheers,

+ seth




Re: using couchdb for access log analysis

Posted by Blair Nilsson <bl...@gmail.com>.
we are doing something pretty much the same here, doing aggregation of
data (we are just after the max values, but you can use it for pretty
much anything) for a bunch of solar water heating units.
overall it works just fine :)

Here are the maps and reduces, you may want to adapt them or write your own.

the map...

function(doc) {
  var displayDate = function(date) {
       var year = date.getFullYear();
       var month = date.getMonth()+1;
       month = ((month < 10) ? "0" : "") + month
       var day = date.getDate();
       day = ((day < 10) ? "0" : "") + day
       var hours = date.getHours()
       hours = ((hours < 10) ? "0" : "") + hours
       var minutes = date.getMinutes()
       minutes=((minutes < 10) ? "0" : "") + minutes
       var seconds = date.getSeconds()
       seconds=((seconds < 10) ? "0" : "") + seconds
       return year+"/"+month+"/"+day+" "+hours+":" + minutes+":"+seconds
  }

  if (doc.type="Solar Reading") {
       var date = new Date(doc.date)
       date.setSeconds(0)
       emit([doc.site, 1, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setMinutes(Math.floor(date.getMinutes()/10) * 10)
       emit([doc.site, 2, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setMinutes(0)
       emit([doc.site, 3, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setHours(0)
       emit([doc.site, 4, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setDate(1)
       emit([doc.site, 5, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
  }
}


the reduce... - note, you end up in rereduce land pretty quickly as we
found, happily, this is just fine for it.

 function(keys, values, rereduce) {
	var panel = values[0][0]
	var inlet = values[0][1]
	var outlet = values[0][2]
	for (i in values) {
		if (values[i][0] > panel) {
			panel = values[i][0]
		}

		if (values[i][1] > inlet) {
			inlet = values[i][1]
		}

		if (values[i][2] > outlet) {
			outlet = values[i][2]
		}
	}
	return [panel, inlet, outlet]
}




On Thu, Jun 18, 2009 at 5:34 AM, Seth Falcon<se...@userprimary.net> wrote:
> Hi all,
>
> I've been exploring using couchdb to store aggregated access log
> data.  I would really appreciate a bit of feedback on the approach
> I've taken.
>
> First, the problem I'm trying solve.  The input data consists of web
> server access logs.  I want to generate a report of how many times
> each page was viewed over different time windows.  For example, I'd
> like to be able to request the report for the day 2009-06-16 as well
> as a particular hour in a given day.
>
> After reading the wiki page about stats aggregation, I started by
> pre-reducing into one minute chunks.  Each one minutes worth of log
> data results in a list of pages and view counts over that minute.  For
> each page/count pair I insert a document like the following into
> couchdb:
>
>     {
>        "_id" : "2009-06-16T13h30_abcdefg",
>        "url" : "/foo/bar/baz",
>        "view_count" : 13
>     }
>
> The ID is the timestamp representing the minute chunk prepended to the
> md5 digest of the url.
>
> One idea for a view that will allow querying different time units is
> to emit keys like the following for the above example doc:
>
>    ["hour",  "2009-06-16T13", "abcdefg"]
>    ["day",   "2009-06-16",    "abcdefg"]
>    ["month", "2009-06",       "abcdefg"]
>
> Then one can query with startkey=["hour", "2009-06-16T13", true] and
> endkey=["hour", "2009-06-16T13", {}] to get a particular hour using
> group=true and a reduce function that sums the view_count.
>
> Does this seem like a reasonable approach?  Would it be better to
> create separate views for hour, day, and month and avoid the
> array-valued keys?  In a previous post, I goofed up in understanding
> how startkey/endkey queries work.  Am I making a similar error in
> thinking with the above array-valued key approach?  I'm thinking this
> is a different case because the time unit is always an exact match in
> startkey and endkey.
>
> Anyhow, I'd really appreciate any suggestions for improvement or
> confirmation that this should kinda-sorta work.
>
> Cheers,
>
> + seth
>

I'm not sure if what we are doing is better or worse, but here is the
approch we are using.