You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/01/16 22:57:50 UTC
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format

jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367692944
 
 

 ##########
 File path: docs/development/extensions-core/hdfs.md
 ##########
 @@ -36,49 +36,110 @@ To use this Apache Druid extension, make sure to [include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
 |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.
+
+#### Configuration for AWS S3
 
-The HDFS extension can also be used for GCS as deep storage.
+To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
 
 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.storage.type`|hdfs||Must be set.|
-|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
+|`druid.storage.type`|hdfs| |Must be set.|
+|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.|
 
-All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in <druid>/lib/ and <druid>/extensions/druid-hdfs-storage/
+You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath.
+Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
 
-Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
-
-<a name="firehose"></a>
+```bash
+java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
+cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
+```
 
-## Native batch ingestion
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+
+```xml
+<property>
+  <name>fs.s3a.impl</name>
+  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
+  <description>The implementation class of the S3A Filesystem</description>
+</property>
+
+<property>
+  <name>fs.AbstractFileSystem.s3a.impl</name>
+  <value>org.apache.hadoop.fs.s3a.S3A</value>
+  <description>The implementation class of the S3A AbstractFileSystem.</description>
+</property>
+
+<property>
+  <name>fs.s3a.access.key</name>
+  <description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description>
+  <value>your access key</value>
+</property>
+
+<property>
+  <name>fs.s3a.secret.key</name>
+  <description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description>
+  <value>your secret key</value>
+</property>
+```
 
-This firehose ingests events from a predefined list of files from a Hadoop filesystem.
-This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` will read an object.
+#### Configuration for Google Cloud Storage
 
 Review comment:
   Is there authentication configuration needed for accessing GCS? Could add that in a follow-on PR if so.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org