You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2018/05/01 17:41:23 UTC
[beam-site] 03/04: Moved the 3rd party extensions to a separate page

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit f8762f6328bcb584e5fc4b4e11bc14e5b870d195
Author: Niels Basjes <nb...@bol.com>
AuthorDate: Thu Apr 26 23:08:21 2018 +0200

    Moved the 3rd party extensions to a separate page
---
 src/_includes/section-menu/sdks.html      |   1 +
 src/documentation/sdks/java-extensions.md | 198 ------------------------------
 src/documentation/sdks/java-thirdparty.md | 100 +++++++++++++++
 src/documentation/sdks/java.md            |   2 +
 4 files changed, 103 insertions(+), 198 deletions(-)

diff --git a/src/_includes/section-menu/sdks.html b/src/_includes/section-menu/sdks.html
index faace4e..729258f 100644
--- a/src/_includes/section-menu/sdks.html
+++ b/src/_includes/section-menu/sdks.html
@@ -9,6 +9,7 @@
                                                                                                                                    alt="External link."></a>
     </li>
     <li><a href="{{ site.baseurl }}/documentation/sdks/java-extensions/">Java SDK extensions</a></li>
+    <li><a href="{{ site.baseurl }}/documentation/sdks/java-thirdparty/">Java 3rd party extensions</a></li>
     <li><a href="{{ site.baseurl }}/documentation/sdks/java/nexmark/">Nexmark benchmark suite</a></li>
   </ul>
 </li>
diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md
index aeabc9f..7742345 100644
--- a/src/documentation/sdks/java-extensions.md
+++ b/src/documentation/sdks/java-extensions.md
@@ -58,201 +58,3 @@ PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
     grouped.apply(
         SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));
 ```
-
-## Parsing HTTPD/NGINX access logs.
-
-The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to
-the webserver. The format of these config files is a configuration option in the Apache HTTPD server so parsing this
-into useful data elements is normally very hard to do.
-
-To solve this problem in an easy way a library was created that works in combination with Apache Beam
-and is capable of doing this for both the Apache HTTPD and NGINX.
-
-The basic idea is that the logformat specification is the schema used to create the line. 
-THis parser is simply initialized with this schema and the list of fields you want to extract.
-
-### Basic usage
-Full documentation can be found here [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) 
-
-First you put something like this in your pom.xml file:
-
-    <dependency>
-      <groupId>nl.basjes.parse.httpdlog</groupId>
-      <artifactId>httpdlog-parser</artifactId>
-      <version>5.0</version>
-    </dependency>
-
-Check [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) for the latest version.
-
-Assume we have a logformat variable that looks something like this:
-
-    String logformat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"";
-
-**Step 1: What CAN we get from this line?**
-
-To figure out what values we CAN get from this line we instantiate the parser with a dummy class
-that does not have ANY @Field annotations or setters. The "Object" class will do just fine for this purpose.
-
-    Parser<Object> dummyParser = new HttpdLoglineParser<Object>(Object.class, logformat);
-    List<String> possiblePaths = dummyParser.getPossiblePaths();
-    for (String path: possiblePaths) {
-      System.out.println(path);
-    }
-
-You will get a list that looks something like this:
-
-    IP:connection.client.host
-    NUMBER:connection.client.logname
-    STRING:connection.client.user
-    TIME.STAMP:request.receive.time
-    TIME.DAY:request.receive.time.day
-    TIME.MONTHNAME:request.receive.time.monthname
-    TIME.MONTH:request.receive.time.month
-    TIME.YEAR:request.receive.time.year
-    TIME.HOUR:request.receive.time.hour
-    TIME.MINUTE:request.receive.time.minute
-    TIME.SECOND:request.receive.time.second
-    TIME.MILLISECOND:request.receive.time.millisecond
-    TIME.ZONE:request.receive.time.timezone
-    HTTP.FIRSTLINE:request.firstline
-    HTTP.METHOD:request.firstline.method
-    HTTP.URI:request.firstline.uri
-    HTTP.QUERYSTRING:request.firstline.uri.query
-    STRING:request.firstline.uri.query.*
-    HTTP.PROTOCOL:request.firstline.protocol
-    HTTP.PROTOCOL.VERSION:request.firstline.protocol.version
-    STRING:request.status.last
-    BYTESCLF:response.body.bytes
-    HTTP.URI:request.referer
-    HTTP.QUERYSTRING:request.referer.query
-    STRING:request.referer.query.*
-    HTTP.USERAGENT:request.user-agent
-
-Now some of these lines contain a * .
-This is a wildcard that can be replaced with any 'name' if you need a specific value.
-You can also leave the '*' and get everything that is found in the actual log line.
-
-**Step 2 Create the receiving POJO**
-
-We need to create the receiving record class that is simply a POJO that does not need any interface or inheritance.
-In this class we create setters that will be called when the specified field has been found in the line.
-
-So we can now add to this class a setter that simply receives a single value as specified using the @Field annotation:
-
-    @Field("IP:connection.client.host")
-    public void setIP(final String value) {
-      ip = value;
-    }
-
-If we really want the name of the field we can also do this
-
-    @Field("STRING:request.firstline.uri.query.img")
-    public void setQueryImg(final String name, final String value) {
-      results.put(name, value);
-    }
-
-This latter form is very handy because this way we can obtain all values for a wildcard field
-
-    @Field("STRING:request.firstline.uri.query.*")
-    public void setQueryStringValues(final String name, final String value) {
-      results.put(name, value);
-    }
-
-Instead of using the annotations on the setters we can also simply tell the parser the name of th setter that must be 
-called when an element is found.
-
-    parser.addParseTarget("setIP",                  "IP:connection.client.host");
-    parser.addParseTarget("setQueryImg",            "STRING:request.firstline.uri.query.img");
-    parser.addParseTarget("setQueryStringValues",   "STRING:request.firstline.uri.query.*");
-
-### Example
-
-Assuming we have a String (being the full log line) comming in and an instance of the WebEvent class comming out
-(where the WebEvent already the has the needed setters) the final code when using this in an Apache Beam project 
-will end up looking something like this
-
-    PCollection<WebEvent> filledWebEvents = input
-      .apply("Extract Elements from logline",
-        ParDo.of(new DoFn<String, WebEvent>() {
-          private Parser<WebEvent> parser;
-    
-          @Setup
-          public void setup() throws NoSuchMethodException {
-            parser = new HttpdLoglineParser<>(WebEvent.class, getLogFormat());
-            parser.addParseTarget("setIP",                  "IP:connection.client.host");
-            parser.addParseTarget("setQueryImg",            "STRING:request.firstline.uri.query.img");
-            parser.addParseTarget("setQueryStringValues",   "STRING:request.firstline.uri.query.*");
-          }
-    
-          @ProcessElement
-          public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure {
-            c.output(parser.parse(c.element()));
-          }
-        })
-      );
-
-## Analyzing the Useragent string
-
-This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible.
-
-### Basic usage
-You can get the prebuilt UDF from maven central.
-If you use a maven based project simply add this dependency to your Apache Beam application.
-
-    <dependency>
-      <groupId>nl.basjes.parse.useragent</groupId>
-      <artifactId>yauaa-beam</artifactId>
-      <version>4.2</version>
-    </dependency>
-
-Check https://github.com/nielsbasjes/yauaa for the latest version.
-
-### Example
-Assume you have a PCollection with your records.
-In most cases I see (clickstream data) these records (in this example this class is called "WebEvent") 
-contain the useragent string in a field and the parsed results must be added to these fields.
-
-Now you must do two things:
-
-  1) Determine the names of the fields you need. Simply call getAllPossibleFieldNamesSorted() to get the list of possible fieldnames you can ask for.
-
-    UserAgentAnalyzer.newBuilder().build()
-      .getAllPossibleFieldNamesSorted()
-        .forEach(field -> System.out.println(field));
-
-and you get something like this:
-
-    DeviceClass
-    DeviceName
-    DeviceBrand
-    DeviceCpu
-    DeviceCpuBits
-    DeviceFirmwareVersion
-    DeviceVersion
-    OperatingSystemClass
-    OperatingSystemName
-    OperatingSystemVersion
-    ...
-
-  2) Add an instance of the (abstract) UserAgentAnalysisDoFn function and implement the functions as shown in the example below. Use the YauaaField annotation to get the setter for the requested fields.
-
-Note that the name of the two setters is not important, the system looks at the annotation.
-
-    .apply("Extract Elements from Useragent",
-      ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() {
-        @Override
-        public String getUserAgentString(WebEvent record) {
-          return record.useragent;
-        }
-
-        @YauaaField("DeviceClass")
-        public void setDC(WebEvent record, String value) {
-          record.deviceClass = value;
-        }
-
-        @YauaaField("AgentNameVersion")
-        public void setANV(WebEvent record, String value) {
-          record.agentNameVersion = value;
-        }
-    }));
-
diff --git a/src/documentation/sdks/java-thirdparty.md b/src/documentation/sdks/java-thirdparty.md
new file mode 100644
index 0000000..af5f745
--- /dev/null
+++ b/src/documentation/sdks/java-thirdparty.md
@@ -0,0 +1,100 @@
+---
+layout: section
+title: "Beam 3rd Party Java Extensions"
+section_menu: section-menu/sdks.html
+permalink: /documentation/sdks/java-thirdparty/
+---
+# Apache Beam 3rd Party Java Extensions
+
+These are some of the 3rd party Java libaries that may be useful for specific applications.
+
+## Parsing HTTPD/NGINX access logs.
+
+### Summary
+The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to
+the webserver. The format of these log files is a configuration option in the Apache HTTPD server so parsing this
+into useful data elements is normally very hard to do.
+
+To solve this problem in an easy way a library was created that works in combination with Apache Beam
+and is capable of doing this for both the Apache HTTPD and NGINX.
+
+The basic idea is that the logformat specification is the schema used to create the line. 
+This parser is simply initialized with this schema and the list of fields you want to extract.
+
+### Project page
+[https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) 
+
+### License
+Apache License 2.0
+
+### Download
+    <dependency>
+      <groupId>nl.basjes.parse.httpdlog</groupId>
+      <artifactId>httpdlog-parser</artifactId>
+      <version>5.0</version>
+    </dependency>
+
+### Code example
+
+Assuming a WebEvent class that has a the setters setIP, setQueryImg and setQueryStringValues
+
+    PCollection<WebEvent> filledWebEvents = input
+      .apply("Extract Elements from logline",
+        ParDo.of(new DoFn<String, WebEvent>() {
+          private Parser<WebEvent> parser;
+    
+          @Setup
+          public void setup() throws NoSuchMethodException {
+            parser = new HttpdLoglineParser<>(WebEvent.class, 
+                "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\"");
+            parser.addParseTarget("setIP",                  "IP:connection.client.host");
+            parser.addParseTarget("setQueryImg",            "STRING:request.firstline.uri.query.img");
+            parser.addParseTarget("setQueryStringValues",   "STRING:request.firstline.uri.query.*");
+          }
+    
+          @ProcessElement
+          public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure {
+            c.output(parser.parse(c.element()));
+          }
+        })
+      );
+
+
+## Analyzing the Useragent string
+
+### Summary
+Parse and analyze the useragent string and extract as many relevant attributes as possible.
+
+### Project page
+[https://github.com/nielsbasjes/yauaa](https://github.com/nielsbasjes/yauaa) 
+
+### License
+Apache License 2.0
+
+### Download
+    <dependency>
+      <groupId>nl.basjes.parse.useragent</groupId>
+      <artifactId>yauaa-beam</artifactId>
+      <version>4.2</version>
+    </dependency>
+
+### Code example
+    PCollection<WebEvent> filledWebEvents = input
+        .apply("Extract Elements from Useragent",
+          ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() {
+            @Override
+            public String getUserAgentString(WebEvent record) {
+              return record.useragent;
+            }
+    
+            @YauaaField("DeviceClass")
+            public void setDC(WebEvent record, String value) {
+              record.deviceClass = value;
+            }
+    
+            @YauaaField("AgentNameVersion")
+            public void setANV(WebEvent record, String value) {
+              record.agentNameVersion = value;
+            }
+        }));
+
diff --git a/src/documentation/sdks/java.md b/src/documentation/sdks/java.md
index 826929e..f5be0fd 100644
--- a/src/documentation/sdks/java.md
+++ b/src/documentation/sdks/java.md
@@ -33,3 +33,5 @@ The Java SDK has the following extensions:
 - [join-library]({{site.baseurl}}/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions.
 - [sorter]({{site.baseurl}}/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables.
 - [Nexmark]({{site.baseurl}}/documentation/sdks/java/nexmark) is a benchmark suite that runs in batch and streaming modes.
+
+In addition several [3rd party Java libraries]({{site.baseurl}}/documentation/sdks/java-thirdparty/) exist.
\ No newline at end of file

-- 
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.