You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2018/05/01 17:41:21 UTC
[beam-site] 01/04: Document Java extensions for parsing Apache
HTTPD logfiles and Useragent strings
This is an automated email from the ASF dual-hosted git repository.
mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit 5e7f1b2ccc3a04ccad148d25ea6204badd1c2e85
Author: Niels Basjes <ni...@basjes.nl>
AuthorDate: Thu Apr 19 13:44:09 2018 +0200
Document Java extensions for parsing Apache HTTPD logfiles and Useragent strings
---
src/documentation/sdks/java-extensions.md | 182 ++++++++++++++++++++++++++++++
1 file changed, 182 insertions(+)
diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md
index 7742345..3b1524f 100644
--- a/src/documentation/sdks/java-extensions.md
+++ b/src/documentation/sdks/java-extensions.md
@@ -58,3 +58,185 @@ PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
grouped.apply(
SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));
```
+
+## Parsing Apache HTTPD and NGINX Access log files.
+
+The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to
+thie webserver. The format of these config files is a configuration option in the Apache HTTPD server so parsing this
+into useful data elements is normally very hard to do.
+
+To solve this problem in an easy way a library was created that works in combination with Apache Beam.
+
+The basic idea is that you should be able to have a parser that you can construct by simply
+telling it with what configuration options the line was written.
+
+### Basic usage
+Full documentation can be found here [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser)
+
+First you put something like this in your pom.xml file:
+
+ <dependency>
+ <groupId>nl.basjes.parse.httpdlog</groupId>
+ <artifactId>httpdlog-parser</artifactId>
+ <version>5.0</version>
+ </dependency>
+
+Check [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) for the latest version.
+
+Assume we have a logformat variable that looks something like this:
+
+ String logformat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"";
+
+**Step 1: What CAN we get from this line?**
+
+To figure out what values we CAN get from this line we instantiate the parser with a dummy class
+that does not have ANY @Field annotations or setters. The "Object" class will do just fine for this purpose.
+
+ Parser<Object> dummyParser = new HttpdLoglineParser<Object>(Object.class, logformat);
+ List<String> possiblePaths = dummyParser.getPossiblePaths();
+ for (String path: possiblePaths) {
+ System.out.println(path);
+ }
+
+You will get a list that looks something like this:
+
+ IP:connection.client.host
+ NUMBER:connection.client.logname
+ STRING:connection.client.user
+ TIME.STAMP:request.receive.time
+ TIME.DAY:request.receive.time.day
+ TIME.MONTHNAME:request.receive.time.monthname
+ TIME.MONTH:request.receive.time.month
+ TIME.YEAR:request.receive.time.year
+ TIME.HOUR:request.receive.time.hour
+ TIME.MINUTE:request.receive.time.minute
+ TIME.SECOND:request.receive.time.second
+ TIME.MILLISECOND:request.receive.time.millisecond
+ TIME.ZONE:request.receive.time.timezone
+ HTTP.FIRSTLINE:request.firstline
+ HTTP.METHOD:request.firstline.method
+ HTTP.URI:request.firstline.uri
+ HTTP.QUERYSTRING:request.firstline.uri.query
+ STRING:request.firstline.uri.query.*
+ HTTP.PROTOCOL:request.firstline.protocol
+ HTTP.PROTOCOL.VERSION:request.firstline.protocol.version
+ STRING:request.status.last
+ BYTESCLF:response.body.bytes
+ HTTP.URI:request.referer
+ HTTP.QUERYSTRING:request.referer.query
+ STRING:request.referer.query.*
+ HTTP.USERAGENT:request.user-agent
+
+Now some of these lines contain a * .
+This is a wildcard that can be replaced with any 'name' if you need a specific value.
+You can also leave the '*' and get everything that is found in the actual log line.
+
+**Step 2 Create the receiving POJO**
+
+We need to create the receiving record class that is simply a POJO that does not need any interface or inheritance.
+In this class we create setters that will be called when the specified field has been found in the line.
+
+So we can now add to this class a setter that simply receives a single value as specified using the @Field annotation:
+
+ @Field("IP:connection.client.host")
+ public void setIP(final String value) {
+ ip = value;
+ }
+
+If we really want the name of the field we can also do this
+
+ @Field("STRING:request.firstline.uri.query.img")
+ public void setQueryImg(final String name, final String value) {
+ results.put(name, value);
+ }
+
+This latter form is very handy because this way we can obtain all values for a wildcard field
+
+ @Field("STRING:request.firstline.uri.query.*")
+ public void setQueryStringValues(final String name, final String value) {
+ results.put(name, value);
+ }
+
+Instead of using the annotations on the setters we can also simply tell the parser the name of th setter that must be
+called when an element is found.
+
+ parser.addParseTarget("setIP", "IP:connection.client.host");
+ parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img");
+ parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*");
+
+### Using this in Apache Beam
+
+Assuming we have a String (being the full log line) comming in and an instance of the WebEvent class comming out
+(where the WebEvent already the has the needed setters) the final code when using this in an Apache Beam project
+will end up looking something like this
+```
+ PCollection<WebEvent> filledWebEvents = input
+ .apply("Extract Elements from logline",
+ ParDo.of(new DoFn<String, WebEvent>() {
+ private Parser<WebEvent> parser;
+
+ @Setup
+ public void setup() throws NoSuchMethodException {
+ parser = new HttpdLoglineParser<>(WebEvent.class, getLogFormat());
+ parser.addParseTarget("setIP", "IP:connection.client.host");
+ parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img");
+ parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*");
+ }
+
+ @ProcessElement
+ public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure {
+ c.output(parser.parse(c.element()));
+ }
+ }));
+
+```
+
+
+## Analyzing the Useragent string
+
+This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible.
+
+### Getting the Beam UDF
+You can get the prebuilt UDF from maven central.
+If you use a maven based project simply add this dependency to your Apache Beam application.
+
+ <dependency>
+ <groupId>nl.basjes.parse.useragent</groupId>
+ <artifactId>yauaa-beam</artifactId>
+ <version>4.2</version>
+ </dependency>
+
+Check https://github.com/nielsbasjes/yauaa for the latest version.
+
+### Example usage
+Assume you have a PCollection with your records.
+In most cases I see (clickstream data) these records (in this example this class is called "WebEvent")
+contain the useragent string in a field and the parsed results must be added to these fields.
+
+Now you must do two things:
+
+ 1) Determine the names of the fields you need.
+ 2) Add an instance of the (abstract) UserAgentAnalysisDoFn function and implement the functions as shown in the example below. Use the YauaaField annotation to get the setter for the requested fields.
+
+Note that the name of the two setters is not important, the system looks at the annotation.
+
+ .apply("Extract Elements from Useragent",
+ ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() {
+ @Override
+ public String getUserAgentString(WebEvent record) {
+ return record.useragent;
+ }
+
+ @SuppressWarnings("unused")
+ @YauaaField("DeviceClass")
+ public void setDC(WebEvent record, String value) {
+ record.deviceClass = value;
+ }
+
+ @SuppressWarnings("unused")
+ @YauaaField("AgentNameVersion")
+ public void setANV(WebEvent record, String value) {
+ record.agentNameVersion = value;
+ }
+ }));
+
--
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.