You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2019/04/24 20:51:59 UTC

[GitHub] [accumulo-website] keith-turner commented on a change in pull request #171: Created docs for using Apache Spark with Accumulo

keith-turner commented on a change in pull request #171: Created docs for using Apache Spark with Accumulo
URL: https://github.com/apache/accumulo-website/pull/171#discussion_r278311345
 
 

 ##########
 File path: _docs-2/development/spark.md
 ##########
 @@ -0,0 +1,107 @@
+---
+title: Spark
+category: development
+order: 3
+---
+
+[Apache Spark] applications can read and write from Accumulo tables.
+
+Before reading this documentation, it may help to review the [MapReduce]
+documentation as API created for MapReduce jobs is used by Spark.
+
+This documentation references code from the Accumulo [Spark example].
+
+## General configuration
+
+1. Create a [shaded jar] with your Spark code and all of your dependencies (excluding
+   Spark and Hadoop). When creating the shaded jar, you should relocate Guava
+   as Accumulo uses a different version. The [pom.xml] in the [Spark example] is
+   a good reference and can be used a a starting point for a Spark application.
+
+2. Submit the job by running `spark-submit` with your shaded jar. You should pass
+   in the location of your `accumulo-client.properties` that will be used to connect
+   to your Accumulo instance.
+    ```bash
+    $SPARK_HOME/bin/spark-submit \
+      --class com.my.spark.job.MainClass \
+      --master yarn \
+      --deploy-mode client \
+      /path/to/spark-job-shaded.jar \
+      /path/to/accumulo-client.properties
+    ```
+
+## Reading from Accumulo table
+
+Apache Spark can read from an Accumulo table by using [AccumuloInputFormat].
+
+```java
+Job job = Job.getInstance();
+AccumuloInputFormat.configure().clientProperties(props).table(inputTable).store(job);
+JavaPairRDD<Key,Value> data = sc.newAPIHadoopRDD(job.getConfiguration(),
+    AccumuloInputFormat.class, Key.class, Value.class);
+```
+
+## Writing to Accumulo table
+
+There are two ways to write an Accumulo table.
+
+### Use a BatchWriter
+
+Write your data to Accumulo by creating an AccumuloClient for each partition and writing all
+data in the partition using a BatchWriter.
+
+```java
+Properties props = Accumulo.newClientProperties()
+                    .from("/path/to/accumulo-client.properties").build();
+JavaPairRDD<Key, Value> dataToWrite = ... ;
+dataToWrite.foreachPartition(iter -> {
+  try (AccumuloClient client = Accumulo.newClient().from(props).build();
 
 Review comment:
   Could add comment above the try like
   
   ```java
     // Create client inside partition so that Spark does not attempt to serialize it.
     try (AccumuloClient client = Accumulo.newClient().from(props).build();
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services