You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2018/05/01 17:41:21 UTC

[beam-site] 01/04: Document Java extensions for parsing Apache HTTPD logfiles and Useragent strings

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 5e7f1b2ccc3a04ccad148d25ea6204badd1c2e85
Author: Niels Basjes <ni...@basjes.nl>
AuthorDate: Thu Apr 19 13:44:09 2018 +0200

    Document Java extensions for parsing Apache HTTPD logfiles and Useragent strings
---
 src/documentation/sdks/java-extensions.md | 182 ++++++++++++++++++++++++++++++
 1 file changed, 182 insertions(+)

diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md
index 7742345..3b1524f 100644
--- a/src/documentation/sdks/java-extensions.md
+++ b/src/documentation/sdks/java-extensions.md
@@ -58,3 +58,185 @@ PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
     grouped.apply(
         SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));
 ```
+
+## Parsing Apache HTTPD and NGINX Access log files.
+
+The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to
+thie webserver. The format of these config files is a configuration option in the Apache HTTPD server so parsing this
+into useful data elements is normally very hard to do.
+
+To solve this problem in an easy way a library was created that works in combination with Apache Beam.
+
+The basic idea is that you should be able to have a parser that you can construct by simply
+telling it with what configuration options the line was written.
+
+### Basic usage
+Full documentation can be found here [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) 
+
+First you put something like this in your pom.xml file:
+
+    <dependency>
+        <groupId>nl.basjes.parse.httpdlog</groupId>
+        <artifactId>httpdlog-parser</artifactId>
+        <version>5.0</version>
+    </dependency>
+
+Check [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) for the latest version.
+
+Assume we have a logformat variable that looks something like this:
+
+    String logformat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"";
+
+**Step 1: What CAN we get from this line?**
+
+To figure out what values we CAN get from this line we instantiate the parser with a dummy class
+that does not have ANY @Field annotations or setters. The "Object" class will do just fine for this purpose.
+
+    Parser<Object> dummyParser = new HttpdLoglineParser<Object>(Object.class, logformat);
+    List<String> possiblePaths = dummyParser.getPossiblePaths();
+    for (String path: possiblePaths) {
+        System.out.println(path);
+    }
+
+You will get a list that looks something like this:
+
+    IP:connection.client.host
+    NUMBER:connection.client.logname
+    STRING:connection.client.user
+    TIME.STAMP:request.receive.time
+    TIME.DAY:request.receive.time.day
+    TIME.MONTHNAME:request.receive.time.monthname
+    TIME.MONTH:request.receive.time.month
+    TIME.YEAR:request.receive.time.year
+    TIME.HOUR:request.receive.time.hour
+    TIME.MINUTE:request.receive.time.minute
+    TIME.SECOND:request.receive.time.second
+    TIME.MILLISECOND:request.receive.time.millisecond
+    TIME.ZONE:request.receive.time.timezone
+    HTTP.FIRSTLINE:request.firstline
+    HTTP.METHOD:request.firstline.method
+    HTTP.URI:request.firstline.uri
+    HTTP.QUERYSTRING:request.firstline.uri.query
+    STRING:request.firstline.uri.query.*
+    HTTP.PROTOCOL:request.firstline.protocol
+    HTTP.PROTOCOL.VERSION:request.firstline.protocol.version
+    STRING:request.status.last
+    BYTESCLF:response.body.bytes
+    HTTP.URI:request.referer
+    HTTP.QUERYSTRING:request.referer.query
+    STRING:request.referer.query.*
+    HTTP.USERAGENT:request.user-agent
+
+Now some of these lines contain a * .
+This is a wildcard that can be replaced with any 'name' if you need a specific value.
+You can also leave the '*' and get everything that is found in the actual log line.
+
+**Step 2 Create the receiving POJO**
+
+We need to create the receiving record class that is simply a POJO that does not need any interface or inheritance.
+In this class we create setters that will be called when the specified field has been found in the line.
+
+So we can now add to this class a setter that simply receives a single value as specified using the @Field annotation:
+
+    @Field("IP:connection.client.host")
+    public void setIP(final String value) {
+        ip = value;
+    }
+
+If we really want the name of the field we can also do this
+
+    @Field("STRING:request.firstline.uri.query.img")
+    public void setQueryImg(final String name, final String value) {
+        results.put(name, value);
+    }
+
+This latter form is very handy because this way we can obtain all values for a wildcard field
+
+    @Field("STRING:request.firstline.uri.query.*")
+    public void setQueryStringValues(final String name, final String value) {
+        results.put(name, value);
+    }
+
+Instead of using the annotations on the setters we can also simply tell the parser the name of th setter that must be 
+called when an element is found.
+
+    parser.addParseTarget("setIP",                  "IP:connection.client.host");
+    parser.addParseTarget("setQueryImg",            "STRING:request.firstline.uri.query.img");
+    parser.addParseTarget("setQueryStringValues",   "STRING:request.firstline.uri.query.*");
+
+### Using this in Apache Beam
+
+Assuming we have a String (being the full log line) comming in and an instance of the WebEvent class comming out
+(where the WebEvent already the has the needed setters) the final code when using this in an Apache Beam project 
+will end up looking something like this
+```
+        PCollection<WebEvent> filledWebEvents = input
+            .apply("Extract Elements from logline",
+                ParDo.of(new DoFn<String, WebEvent>() {
+                    private Parser<WebEvent> parser;
+
+                    @Setup
+                    public void setup() throws NoSuchMethodException {
+                        parser = new HttpdLoglineParser<>(WebEvent.class, getLogFormat());
+                        parser.addParseTarget("setIP",                  "IP:connection.client.host");
+                        parser.addParseTarget("setQueryImg",            "STRING:request.firstline.uri.query.img");
+                        parser.addParseTarget("setQueryStringValues",   "STRING:request.firstline.uri.query.*");
+                    }
+
+                    @ProcessElement
+                    public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure {
+                        c.output(parser.parse(c.element()));
+                    }
+                }));
+
+```
+
+
+## Analyzing the Useragent string
+
+This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible.
+
+### Getting the Beam UDF
+You can get the prebuilt UDF from maven central.
+If you use a maven based project simply add this dependency to your Apache Beam application.
+
+    <dependency>
+      <groupId>nl.basjes.parse.useragent</groupId>
+      <artifactId>yauaa-beam</artifactId>
+      <version>4.2</version>
+    </dependency>
+
+Check https://github.com/nielsbasjes/yauaa for the latest version.
+
+### Example usage
+Assume you have a PCollection with your records.
+In most cases I see (clickstream data) these records (in this example this class is called "WebEvent") 
+contain the useragent string in a field and the parsed results must be added to these fields.
+
+Now you must do two things:
+
+  1) Determine the names of the fields you need.
+  2) Add an instance of the (abstract) UserAgentAnalysisDoFn function and implement the functions as shown in the example below. Use the YauaaField annotation to get the setter for the requested fields.
+
+Note that the name of the two setters is not important, the system looks at the annotation.
+
+    .apply("Extract Elements from Useragent",
+        ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() {
+            @Override
+            public String getUserAgentString(WebEvent record) {
+                return record.useragent;
+            }
+
+            @SuppressWarnings("unused")
+            @YauaaField("DeviceClass")
+            public void setDC(WebEvent record, String value) {
+                record.deviceClass = value;
+            }
+
+            @SuppressWarnings("unused")
+            @YauaaField("AgentNameVersion")
+            public void setANV(WebEvent record, String value) {
+                record.agentNameVersion = value;
+            }
+        }));
+

-- 
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.