You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Doug Cutting <do...@cottrell-cutting.net> on 2008/05/01 23:12:58 UTC
ongoing · Wide Finder 2
Anyone want to play? The goal is to find a small program that quickly
computes some statistics over 45GB of log data on a 32-core box. Hadoop
seems like a good candidate. Streaming? Pig? Java?
http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2
Doug
Re: ongoing · Wide Finder 2
Posted by Chris K Wensel <ch...@wensel.net>.
or Cascading (+Groovy).
should have a release of my Groovy Cascading builder by this weekend...
def APACHE_COMMON_REGEX = /^([^ ]*) +[^ ]* +[^ ]* +\[([^]]*)\] +
\"([^ ]*) ([^ ]*) [^ ]*\" ([^ ]*) ([^ ]*).*$/
def APACHE_COMMON_GROUPS = [1, 2, 3, 4, 5, 6]
def APACHE_COMMON_FIELDS = ["ip", "time", "method", "url", "status",
"size"]
def URL_PATTERN = /\/ongoing\/When\/\d\d\dx\/\d\d\d\d\/\d\d\/\d\d\/
[^ .]+/
def cascading = new Cascading()
def builder = cascading.builder();
Flow flow = builder.flow("widefinder")
{
source(input, scheme: text())
// parse apache log
regexParser(pattern: APACHE_COMMON_REGEX, groups:
APACHE_COMMON_GROUPS, declared: APACHE_COMMON_FIELDS )
// throw away tuples that don't match
filter(arguments:["url"], pattern:URL_PATTERN)
// throw away unused fields
project(arguments:["url"])
group(groupBy:["url"])
// creates 'count' field, by default
count()
// group/sort on 'count', reverse the sort order
group(["count"], reverse: true)
sink(output, delete: true)
}
flow.complete() // execute the flow
On May 1, 2008, at 2:12 PM, Doug Cutting wrote:
> Anyone want to play? The goal is to find a small program that
> quickly computes some statistics over 45GB of log data on a 32-core
> box. Hadoop seems like a good candidate. Streaming? Pig? Java?
>
> http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2
>
> Doug
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/