You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/24 05:56:25 UTC
[GitHub] [druid] jihoonson commented on a change in pull request #11823: Add Spark connector reader support.

jihoonson commented on a change in pull request #11823:
URL: https://github.com/apache/druid/pull/11823#discussion_r755721373



##########
File path: docs/operations/spark.md
##########
@@ -0,0 +1,279 @@
+---
+id: spark
+title: "Apache Spark Reader and Writer"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+# Apache Spark Reader and Writer for Druid
+
+## Reader
+The reader reads Druid segments from deep storage into Spark. It locates the segments to read and determines their
+schema if not provided by querying the brokers for the relevant metadata but otherwise does not interact with a running
+Druid cluster.
+
+Sample Code:
+```scala
+import org.apache.druid.spark.DruidDataFrameReader
+
+val deepStorageConfig = new LocalDeepStorageConfig().storageDirectory("/mnt/druid/druid-segments/")
+
+sparkSession
+  .read
+  .brokerHost("localhost")
+  .brokerPort(8082)
+  .metadataDbType("mysql")
+  .metadataUri("jdbc:mysql://druid.metadata.server:3306/druid")
+  .metadataUser("druid")
+  .metadataPassword("diurd")
+  .dataSource("dataSource")
+  .deepStorage(deepStorageConfig)
+  .druid()
+```
+
+Alternatively, the reader can be configured via a properties map with no additional import needed:
+```scala
+val properties = Map[String, String](
+  "metadata.dbType" -> "mysql",
+  "metadata.connectUri" -> "jdbc:mysql://druid.metadata.server:3306/druid",
+  "metadata.user" -> "druid",
+  "metadata.password" -> "diurd",
+  "broker.host" -> "localhost",
+  "broker.port" -> 8082,
+  "table" -> "dataSource",
+  "reader.deepStorageType" -> "local",
+  "local.storageDirectory" -> "/mnt/druid/druid-segments/"
+)
+
+sparkSession
+  .read
+  .format("druid")
+  .options(properties)
+  .load()
+```
+
+If you know the schema of the Druid data source you're reading from, you can save needing to determine the schema via
+calls to the broker with
+```scala
+sparkSession
+  .read
+  .format("druid")
+  .schema(schema)
+  .options(properties)
+  .load()
+```
+
+Filters should be applied to the read-in data frame before any [Spark actions](http://spark.apache.org/docs/2.4.7/api/scala/index.html#org.apache.spark.sql.Dataset)
+are triggered, to allow predicates to be pushed down to the reader and avoid full scans of the underlying Druid data.
+
+## Plugin Registries and Druid Extension Support
+One of Druid's strengths is its extensibility. Since these Spark readers and writers will not execute on a Druid cluster
+and won't have the ability to dynamically load classes or integrate with Druid's Guice injectors, Druid extensions can't
+be used directly. Instead, these connectors use a plugin registry architecture, including default plugins that support
+most functionality in `extensions-core`. Custom plugins consisting of a string name and one or more serializable
+generator functions must be registered before the first Spark action which would depend on them is called.
+
+### ComplexMetricRegistry
+The `ComplexMetricRegistry` provides support for serializing and deserializing complex metric types between Spark and
+Druid. Support for complex metric types in Druid core extensions is provided out of the box.
+
+Users wishing to override the default behavior or who need to add support for additional complex metric types can
+use the `ComplexMetricRegistry.register` functions to associate serde functions with a given complex metric type. The
+name used to register custom behavior must match the complex metric type name reported by Druid.
+**Note that custom plugins must be registered with both the executors and the Spark driver.**
+
+### SegmentReaderRegistry
+The `SegmentReaderRegistry` provides support for reading segments from deep storage. Local, HDFS, GCS, S3, and Azure
+Storage deep storage implementations are supported by default.
+
+Users wishing to override the default behavior or who need to add support for additional deep storage implementations
+can use either `SegmentReaderRegistry.registerInitializer` (to provide any necessary Jackson configuration for
+deserializing a `LoadSpec` object from a segment load spec) or `SegmentReaderRegistry.registerLoadFunction` (to register
+a function for creating a URI from a segment load spec). These two functions correspond to the first and second approach
+[outlined below](#deep-storage). **Note that custom plugins must be registered on the executors, not the Spark driver.**
+
+### SQLConnectorRegistry
+The `SQLConnectorRegistry` provides support for configuring connectors to Druid metadata databases. Support for MySQL,
+PostgreSQL, and Derby databases are provided out of the box.
+
+Users wishing to override the default behavior or who need to add support for additional metadata database
+implementations can use the `SQLConnectorRegistry.register` function. Custom connectors should be registered on the
+driver.
+
+## Deploying to a Spark cluster
+This extension can be run on a Spark cluster in one of two ways: bundled as part of an application jar or uploaded as
+a library jar to a Spark cluster and included in the classpath provided to Spark applications by the application
+manager. If the second approach is used, this extension should be built with
+`mvn clean package -pl spark` and the resulting jar `druid-spark-<VERSION>.jar`
+uploaded to the Spark cluster. Application jars should then be built with a compile-time dependency on
+`org.apache.druid:druid-spark` (e.g. marked as `provided` in Maven or with `compileOnly` in Gradle).
+
+## Configuration Reference
+
+### Metadata Client Configs
+The properties used to configure the client that interacts with the Druid metadata server directly. Used by both reader
+and the writer. The `metadataPassword` property can either be provided as a string that will be used as-is or can be
+provided as a serialized DynamicConfigProvider that will be resolved when the metadata client is first instantiated. If
+a  custom DynamicConfigProvider is used, be sure to register the provider with the DynamicConfigProviderRegistry before use.
+
+|Key|Description|Required|Default|
+|---|-----------|--------|-------|
+|`metadata.dbType`|The metadata server's database type (e.g. `mysql`)|Yes||
+|`metadata.host`|The metadata server's host name|If using derby|`localhost`|
+|`metadata.port`|The metadata server's port|If using derby|1527|
+|`metadata.connectUri`|The URI to use to connect to the metadata server|If not using derby||
+|`metadata.user`|The user to use when connecting to the metadata server|If required by the metadata database||
+|`metadata.password`|The password to use when connecting to the metadata server. This can optionally be a serialized instance of a Druid DynamicConfigProvider or a plain string|If required by the metadata database||
+|`metadata.dbcpProperties`|The connection pooling properties to use when connecting to the metadata server|No||
+|`metadata.baseName`|The base name used when creating Druid metadata tables|No|`druid`|
+
+### Druid Client Configs
+The configuration properties used to query the Druid cluster for segment metadata. Only used in the reader.
+
+|Key|Description|Required|Default|
+|---|-----------|--------|-------|
+|`broker.host`|The hostname of a broker in the Druid cluster to read from|No|`localhost`|
+|`broker.port`|The port of the broker in the Druid cluster to read from|No|8082|
+|`broker.numRetries`|The number of times to retry a timed-out segment metadata request|No|5|
+|`broker.retryWaitSeconds`|How long (in seconds) to wait before retrying a timed-out segment metadata request|No|5|
+|`broker.timeoutMilliseconds`|How long (in milliseconds) to wait before timing out a segment metadata request|No|300000|
+
+### Reader Configs
+The properties used to configure the DataSourceReader when reading data from Druid in Spark.
+
+|Key|Description|Required|Default|
+|---|-----------|--------|-------|
+|`table`|The Druid data source to read from|Yes||
+|`reader.deepStorageType`|The type of deep storage used to back the target Druid cluster|No|`local`|
+|`reader.segments`|A hard-coded list of Druid segments to read. If set, the table and druid client configurations are ignored and the specified segments are read directly. Must be deserializable into Druid DataSegment instances|No|
+|`reader.useCompactSketches`|Controls whether or not compact representations of complex metrics are used (only for metrics that support compact forms)|No|False|

Review comment:
       Then, perhaps better rephrasing this to something like: 
   
   ```suggestion
   |`reader.useCompactSketches`|Controls whether or not compact representations of sketch metrics are used|No|False|
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org