You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2018/08/20 21:40:38 UTC

[beam-site] 01/11: Add blog post "A review of input streaming connectors"

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 97c2ac51e3eece847dab1323b144386f1d0c89ab
Author: Julien Phalip <jp...@gmail.com>
AuthorDate: Thu Aug 2 21:04:18 2018 -0700

    Add blog post "A review of input streaming connectors"
---
 ...2018-08-XX-review-input-streaming-connectors.md | 224 +++++++++++++++++++++
 1 file changed, 224 insertions(+)

diff --git a/src/_posts/2018-08-XX-review-input-streaming-connectors.md b/src/_posts/2018-08-XX-review-input-streaming-connectors.md
new file mode 100644
index 0000000..7591ba2
--- /dev/null
+++ b/src/_posts/2018-08-XX-review-input-streaming-connectors.md
@@ -0,0 +1,224 @@
+---
+layout: post
+title:  "A review of input streaming connectors"
+date:   2018-08-XX 00:00:01 -0800
+excerpt_separator: <!--more-->
+categories: blog
+authors:
+  - lkulighin
+  - julienphalip
+---
+
+In this post, you'll learn about the current state of support for input streaming connectors in [Apache Beam](https://beam.apache.org/). For more context, you'll also learn about the corresponding state of support in [Apache Spark](https://spark.apache.org/).<!--more-->
+
+With batch processing, you might load data from any source, including a database system. Even if there are no specific SDKs available for those database systems, you can often resort to using a [JDBC](https://en.wikipedia.org/wiki/Java_Database_Connectivity) driver. With streaming, implementing a proper data pipeline is arguably more challenging as generally fewer source types are available. For that reason, this article particularly focuses on the streaming use case.
+
+## Connectors for Java
+
+Beam has an official [Java SDK](https://beam.apache.org/documentation/sdks/java/) and has several execution engines, called [runners](https://beam.apache.org/documentation/runners/capability-matrix/). In most cases it is fairly easy to transfer existing Beam pipelines written in Java or Scala to a Spark environment by using the [Spark Runner](https://beam.apache.org/documentation/runners/spark/).
+
+Spark is written in Scala and has a [Java API](https://spark.apache.org/docs/latest/api/java/). Spark's source code compiles to [Java bytecode](https://en.wikipedia.org/wiki/Java_(programming_language)#Java_JVM_and_Bytecode) and the binaries are run by a [Java Virtual Machine](https://en.wikipedia.org/wiki/Java_virtual_machine). Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa).
+
+Spark offers two approaches to streaming: [Discretized Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html) (or DStreams) and [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html). DStreams are a basic abstraction that represents a continuous series of [Resilient Distributed Datasets](https://spark.apache.org/docs/latest/rdd-programming-guide.html) (or RDDs). Structured Streaming was introduced more recently  [...]
+
+Spark Structured Streaming supports [file sources](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html) (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and [Kafka](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through link [...]
+
+Below are the main streaming input connectors for available for Beam and Spark DStreams in Java:
+
+<table>
+  <tr>
+   <td>
+   </td>
+   <td>
+   </td>
+   <td><strong>Apache Beam</strong>
+   </td>
+   <td><strong>Apache Spark DStreams</strong>
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="2" >File Systems
+   </td>
+   <td>Local<br>(Using the <code>file://</code> URI)
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a>
+   </td>
+   <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/StreamingContext.html#textFileStream-java.lang.String-">textFileStream</a><br>(Spark treats most Unix systems as HDFS-compatible, but the location should be accessible from all nodes)
+   </td>
+  </tr>
+  <tr>
+   <td>HDFS<br>(Using the <code>hdfs://</code> URI)
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a>
+   </td>
+   <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/util/HdfsUtils.html">HdfsUtils</a>
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="2" >Object Stores
+   </td>
+   <td>Cloud Storage<br>(Using the <code>gs://</code> URI)
+   </td>
+   <td rowspan="2" ><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a>
+   </td>
+   <td rowspan="2" ><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--">hadoopConfiguration</a>
+<p>
+and <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/StreamingContext.html#textFileStream-java.lang.String-">textFileStream</a>
+   </td>
+  </tr>
+  <tr>
+   <td>S3<br>(Using the <code>s3://</code> URI)
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="3" >Messaging Queues
+   </td>
+   <td>Kafka
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html">KafkaIO</a>
+   </td>
+   <td><a href="https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html">spark-streaming-kafka</a>
+   </td>
+  </tr>
+  <tr>
+   <td>Kinesis
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/kinesis/KinesisIO.html">KinesisIO</a>
+   </td>
+   <td><a href="https://spark.apache.org/docs/latest/streaming-kinesis-integration.html">spark-streaming-kinesis</a>
+   </td>
+  </tr>
+  <tr>
+   <td>Cloud Pub/Sub
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html">PubsubIO</a>
+   </td>
+   <td><a href="https://github.com/apache/bahir/tree/master/streaming-pubsub">Spark-streaming-pubsub</a> from <a href="http://bahir.apache.org">Apache Bahir</a>
+   </td>
+  </tr>
+  <tr>
+   <td>Other
+   </td>
+   <td>Custom receivers
+   </td>
+   <td><a href="https://beam.apache.org/documentation/io/authoring-overview/#read-transforms">Read Transforms</a>
+   </td>
+   <td><a href="https://spark.apache.org/docs/latest/streaming-custom-receivers.html">receiverStream</a>
+   </td>
+  </tr>
+</table>
+
+## Connectors for Python
+
+Beam has an official [Python SDK](https://beam.apache.org/documentation/sdks/python/) that currently supports a subset of the streaming features available in the Java SDK. Active development is underway to bridge the gap between the featuresets in the two SDKs. Currently for Python, the [Direct Runner](https://beam.apache.org/documentation/runners/direct/) and [Dataflow Runner](https://beam.apache.org/documentation/runners/dataflow/) are supported, and [several streaming options](https:/ [...]
+
+Spark also has a Python SDK called [PySpark](http://spark.apache.org/docs/latest/api/python/pyspark.html). As mentioned earlier, Scala code compiles to a bytecode that is executed by the JVM. PySpark uses [Py4J](https://www.py4j.org/), a library that enables Python programs to interact with the JVM and therefore access Java libraries, interact with Java objects, and register callbacks from Java. This allows PySpark to access native Spark objects like RDDs. Spark Structured Streaming supp [...]
+
+Below are the main streaming input connectors for available for Beam and Spark DStreams in Python:
+
+<table>
+  <tr>
+   <td>
+   </td>
+   <td>
+   </td>
+   <td><strong>Apache Beam</strong>
+   </td>
+   <td><strong>Apache Spark DStreams</strong>
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="2" >File Systems
+   </td>
+   <td>Local
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/pydoc/2.5.0/apache_beam.io.textio.html">io.textio</a>
+   </td>
+   <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a>
+   </td>
+  </tr>
+  <tr>
+   <td>HDFS
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/pydoc/2.5.0/apache_beam.io.hadoopfilesystem.html">io.hadoopfilesystem</a>
+   </td>
+   <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--">hadoopConfiguration</a> (Access through <code>sc._jsc</code> with Py4J)
+and <a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a>
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="2" >Object stores
+   </td>
+   <td>Google Cloud Storage
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/pydoc/2.5.0/apache_beam.io.gcp.gcsio.html">io.gcp.gcsio</a>
+   </td>
+   <td rowspan="2" ><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a>
+   </td>
+  </tr>
+  <tr>
+   <td>S3
+   </td>
+   <td>N/A
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="3" >Messaging Queues
+   </td>
+   <td>Kafka
+   </td>
+   <td>N/A
+   </td>
+   <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.kafka.KafkaUtils">KafkaUtils</a>
+   </td>
+  </tr>
+  <tr>
+   <td>Kinesis
+   </td>
+   <td>N/A
+   </td>
+   <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#module-pyspark.streaming.kinesis">KinesisUtils</a>
+   </td>
+  </tr>
+  <tr>
+   <td>Cloud Pub/Sub
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/pydoc/2.5.0/apache_beam.io.gcp.pubsub.html">io.gcp.pubsub</a>
+   </td>
+   <td>N/A
+   </td>
+  </tr>
+  <tr>
+   <td>Other
+   </td>
+   <td>Custom receivers
+   </td>
+   <td><a href="https://beam.apache.org/documentation/sdks/python-custom-io/">BoundedSource and RangeTracker</a>
+   </td>
+   <td>N/A
+   </td>
+  </tr>
+</table>
+
+## Connectors for other languages
+
+### **Scala**
+
+Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a [Scala SDK](https://github.com/spotify/scio) open-sourced [by Spotify](https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/).
+
+### **Go**
+
+A [Go SDK](https://beam.apache.org/documentation/sdks/go/) for Apache Beam is under active development. It is currently experimental and is not recommended for production.
+
+### **R**
+
+Apache Beam does not have an official R SDK. Spark Structured Streaming is supported by an [R SDK](https://spark.apache.org/docs/latest/sparkr.html#structured-streaming), but only for [file sources](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources) as a streaming input.
+
+## Next steps
+
+We hope this article inspired you to try new and interesting ways of connecting streaming sources to your Beam pipelines!
+
+Check out the following links for further information:
+
+*   See a full list of all built-in and in-progress [I/O Transforms](https://beam.apache.org/documentation/io/built-in/) for Apache Beam.
+*   Learn about some Apache Beam mobile gaming pipeline [examples](https://beam.apache.org/get-started/mobile-gaming-example/).