You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2018/05/01 17:41:23 UTC
[beam-site] 03/04: Moved the 3rd party extensions to a separate page
This is an automated email from the ASF dual-hosted git repository.
mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit f8762f6328bcb584e5fc4b4e11bc14e5b870d195
Author: Niels Basjes <nb...@bol.com>
AuthorDate: Thu Apr 26 23:08:21 2018 +0200
Moved the 3rd party extensions to a separate page
---
src/_includes/section-menu/sdks.html | 1 +
src/documentation/sdks/java-extensions.md | 198 ------------------------------
src/documentation/sdks/java-thirdparty.md | 100 +++++++++++++++
src/documentation/sdks/java.md | 2 +
4 files changed, 103 insertions(+), 198 deletions(-)
diff --git a/src/_includes/section-menu/sdks.html b/src/_includes/section-menu/sdks.html
index faace4e..729258f 100644
--- a/src/_includes/section-menu/sdks.html
+++ b/src/_includes/section-menu/sdks.html
@@ -9,6 +9,7 @@
alt="External link."></a>
</li>
<li><a href="{{ site.baseurl }}/documentation/sdks/java-extensions/">Java SDK extensions</a></li>
+ <li><a href="{{ site.baseurl }}/documentation/sdks/java-thirdparty/">Java 3rd party extensions</a></li>
<li><a href="{{ site.baseurl }}/documentation/sdks/java/nexmark/">Nexmark benchmark suite</a></li>
</ul>
</li>
diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md
index aeabc9f..7742345 100644
--- a/src/documentation/sdks/java-extensions.md
+++ b/src/documentation/sdks/java-extensions.md
@@ -58,201 +58,3 @@ PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
grouped.apply(
SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));
```
-
-## Parsing HTTPD/NGINX access logs.
-
-The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to
-the webserver. The format of these config files is a configuration option in the Apache HTTPD server so parsing this
-into useful data elements is normally very hard to do.
-
-To solve this problem in an easy way a library was created that works in combination with Apache Beam
-and is capable of doing this for both the Apache HTTPD and NGINX.
-
-The basic idea is that the logformat specification is the schema used to create the line.
-THis parser is simply initialized with this schema and the list of fields you want to extract.
-
-### Basic usage
-Full documentation can be found here [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser)
-
-First you put something like this in your pom.xml file:
-
- <dependency>
- <groupId>nl.basjes.parse.httpdlog</groupId>
- <artifactId>httpdlog-parser</artifactId>
- <version>5.0</version>
- </dependency>
-
-Check [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) for the latest version.
-
-Assume we have a logformat variable that looks something like this:
-
- String logformat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"";
-
-**Step 1: What CAN we get from this line?**
-
-To figure out what values we CAN get from this line we instantiate the parser with a dummy class
-that does not have ANY @Field annotations or setters. The "Object" class will do just fine for this purpose.
-
- Parser<Object> dummyParser = new HttpdLoglineParser<Object>(Object.class, logformat);
- List<String> possiblePaths = dummyParser.getPossiblePaths();
- for (String path: possiblePaths) {
- System.out.println(path);
- }
-
-You will get a list that looks something like this:
-
- IP:connection.client.host
- NUMBER:connection.client.logname
- STRING:connection.client.user
- TIME.STAMP:request.receive.time
- TIME.DAY:request.receive.time.day
- TIME.MONTHNAME:request.receive.time.monthname
- TIME.MONTH:request.receive.time.month
- TIME.YEAR:request.receive.time.year
- TIME.HOUR:request.receive.time.hour
- TIME.MINUTE:request.receive.time.minute
- TIME.SECOND:request.receive.time.second
- TIME.MILLISECOND:request.receive.time.millisecond
- TIME.ZONE:request.receive.time.timezone
- HTTP.FIRSTLINE:request.firstline
- HTTP.METHOD:request.firstline.method
- HTTP.URI:request.firstline.uri
- HTTP.QUERYSTRING:request.firstline.uri.query
- STRING:request.firstline.uri.query.*
- HTTP.PROTOCOL:request.firstline.protocol
- HTTP.PROTOCOL.VERSION:request.firstline.protocol.version
- STRING:request.status.last
- BYTESCLF:response.body.bytes
- HTTP.URI:request.referer
- HTTP.QUERYSTRING:request.referer.query
- STRING:request.referer.query.*
- HTTP.USERAGENT:request.user-agent
-
-Now some of these lines contain a * .
-This is a wildcard that can be replaced with any 'name' if you need a specific value.
-You can also leave the '*' and get everything that is found in the actual log line.
-
-**Step 2 Create the receiving POJO**
-
-We need to create the receiving record class that is simply a POJO that does not need any interface or inheritance.
-In this class we create setters that will be called when the specified field has been found in the line.
-
-So we can now add to this class a setter that simply receives a single value as specified using the @Field annotation:
-
- @Field("IP:connection.client.host")
- public void setIP(final String value) {
- ip = value;
- }
-
-If we really want the name of the field we can also do this
-
- @Field("STRING:request.firstline.uri.query.img")
- public void setQueryImg(final String name, final String value) {
- results.put(name, value);
- }
-
-This latter form is very handy because this way we can obtain all values for a wildcard field
-
- @Field("STRING:request.firstline.uri.query.*")
- public void setQueryStringValues(final String name, final String value) {
- results.put(name, value);
- }
-
-Instead of using the annotations on the setters we can also simply tell the parser the name of th setter that must be
-called when an element is found.
-
- parser.addParseTarget("setIP", "IP:connection.client.host");
- parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img");
- parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*");
-
-### Example
-
-Assuming we have a String (being the full log line) comming in and an instance of the WebEvent class comming out
-(where the WebEvent already the has the needed setters) the final code when using this in an Apache Beam project
-will end up looking something like this
-
- PCollection<WebEvent> filledWebEvents = input
- .apply("Extract Elements from logline",
- ParDo.of(new DoFn<String, WebEvent>() {
- private Parser<WebEvent> parser;
-
- @Setup
- public void setup() throws NoSuchMethodException {
- parser = new HttpdLoglineParser<>(WebEvent.class, getLogFormat());
- parser.addParseTarget("setIP", "IP:connection.client.host");
- parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img");
- parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*");
- }
-
- @ProcessElement
- public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure {
- c.output(parser.parse(c.element()));
- }
- })
- );
-
-## Analyzing the Useragent string
-
-This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible.
-
-### Basic usage
-You can get the prebuilt UDF from maven central.
-If you use a maven based project simply add this dependency to your Apache Beam application.
-
- <dependency>
- <groupId>nl.basjes.parse.useragent</groupId>
- <artifactId>yauaa-beam</artifactId>
- <version>4.2</version>
- </dependency>
-
-Check https://github.com/nielsbasjes/yauaa for the latest version.
-
-### Example
-Assume you have a PCollection with your records.
-In most cases I see (clickstream data) these records (in this example this class is called "WebEvent")
-contain the useragent string in a field and the parsed results must be added to these fields.
-
-Now you must do two things:
-
- 1) Determine the names of the fields you need. Simply call getAllPossibleFieldNamesSorted() to get the list of possible fieldnames you can ask for.
-
- UserAgentAnalyzer.newBuilder().build()
- .getAllPossibleFieldNamesSorted()
- .forEach(field -> System.out.println(field));
-
-and you get something like this:
-
- DeviceClass
- DeviceName
- DeviceBrand
- DeviceCpu
- DeviceCpuBits
- DeviceFirmwareVersion
- DeviceVersion
- OperatingSystemClass
- OperatingSystemName
- OperatingSystemVersion
- ...
-
- 2) Add an instance of the (abstract) UserAgentAnalysisDoFn function and implement the functions as shown in the example below. Use the YauaaField annotation to get the setter for the requested fields.
-
-Note that the name of the two setters is not important, the system looks at the annotation.
-
- .apply("Extract Elements from Useragent",
- ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() {
- @Override
- public String getUserAgentString(WebEvent record) {
- return record.useragent;
- }
-
- @YauaaField("DeviceClass")
- public void setDC(WebEvent record, String value) {
- record.deviceClass = value;
- }
-
- @YauaaField("AgentNameVersion")
- public void setANV(WebEvent record, String value) {
- record.agentNameVersion = value;
- }
- }));
-
diff --git a/src/documentation/sdks/java-thirdparty.md b/src/documentation/sdks/java-thirdparty.md
new file mode 100644
index 0000000..af5f745
--- /dev/null
+++ b/src/documentation/sdks/java-thirdparty.md
@@ -0,0 +1,100 @@
+---
+layout: section
+title: "Beam 3rd Party Java Extensions"
+section_menu: section-menu/sdks.html
+permalink: /documentation/sdks/java-thirdparty/
+---
+# Apache Beam 3rd Party Java Extensions
+
+These are some of the 3rd party Java libaries that may be useful for specific applications.
+
+## Parsing HTTPD/NGINX access logs.
+
+### Summary
+The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to
+the webserver. The format of these log files is a configuration option in the Apache HTTPD server so parsing this
+into useful data elements is normally very hard to do.
+
+To solve this problem in an easy way a library was created that works in combination with Apache Beam
+and is capable of doing this for both the Apache HTTPD and NGINX.
+
+The basic idea is that the logformat specification is the schema used to create the line.
+This parser is simply initialized with this schema and the list of fields you want to extract.
+
+### Project page
+[https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser)
+
+### License
+Apache License 2.0
+
+### Download
+ <dependency>
+ <groupId>nl.basjes.parse.httpdlog</groupId>
+ <artifactId>httpdlog-parser</artifactId>
+ <version>5.0</version>
+ </dependency>
+
+### Code example
+
+Assuming a WebEvent class that has a the setters setIP, setQueryImg and setQueryStringValues
+
+ PCollection<WebEvent> filledWebEvents = input
+ .apply("Extract Elements from logline",
+ ParDo.of(new DoFn<String, WebEvent>() {
+ private Parser<WebEvent> parser;
+
+ @Setup
+ public void setup() throws NoSuchMethodException {
+ parser = new HttpdLoglineParser<>(WebEvent.class,
+ "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\"");
+ parser.addParseTarget("setIP", "IP:connection.client.host");
+ parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img");
+ parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*");
+ }
+
+ @ProcessElement
+ public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure {
+ c.output(parser.parse(c.element()));
+ }
+ })
+ );
+
+
+## Analyzing the Useragent string
+
+### Summary
+Parse and analyze the useragent string and extract as many relevant attributes as possible.
+
+### Project page
+[https://github.com/nielsbasjes/yauaa](https://github.com/nielsbasjes/yauaa)
+
+### License
+Apache License 2.0
+
+### Download
+ <dependency>
+ <groupId>nl.basjes.parse.useragent</groupId>
+ <artifactId>yauaa-beam</artifactId>
+ <version>4.2</version>
+ </dependency>
+
+### Code example
+ PCollection<WebEvent> filledWebEvents = input
+ .apply("Extract Elements from Useragent",
+ ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() {
+ @Override
+ public String getUserAgentString(WebEvent record) {
+ return record.useragent;
+ }
+
+ @YauaaField("DeviceClass")
+ public void setDC(WebEvent record, String value) {
+ record.deviceClass = value;
+ }
+
+ @YauaaField("AgentNameVersion")
+ public void setANV(WebEvent record, String value) {
+ record.agentNameVersion = value;
+ }
+ }));
+
diff --git a/src/documentation/sdks/java.md b/src/documentation/sdks/java.md
index 826929e..f5be0fd 100644
--- a/src/documentation/sdks/java.md
+++ b/src/documentation/sdks/java.md
@@ -33,3 +33,5 @@ The Java SDK has the following extensions:
- [join-library]({{site.baseurl}}/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions.
- [sorter]({{site.baseurl}}/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables.
- [Nexmark]({{site.baseurl}}/documentation/sdks/java/nexmark) is a benchmark suite that runs in batch and streaming modes.
+
+In addition several [3rd party Java libraries]({{site.baseurl}}/documentation/sdks/java-thirdparty/) exist.
\ No newline at end of file
--
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.