You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/12/14 05:04:31 UTC
[GitHub] [druid] JulianJaffePinterest edited a comment on pull request #10920: Spark Direct Readers and Writers for Druid.
JulianJaffePinterest edited a comment on pull request #10920:
URL: https://github.com/apache/druid/pull/10920#issuecomment-993161134
Calling `.partitionBy` on a `DataFrameWriter` (what you get when you call `.write()` on a DataFrame`) doesn't do anything for a v2 data source that doesn't have a managed catalog, which Druid does not (see #11929 for a recent example). The [docs](https://github.com/apache/druid/blob/8392f87236d4a9795aa4e2867eea18cdf0aeb8ec/docs/operations/spark.md#writer) have a more in-depth discussion of partitioning, but the short version is that you'll either need to partition your dataframe before calling `.write()` on it or use one of the `DruidDataFrame` wrapper's convenience methods (for example,
```scala
import org.apache.druid.spark.DruidDataFrame
df.partitionAndWrite("__time", "millis", "DAY", 200000).format("druid").mode(SaveMode.Overwrite).options(map).save()
```
or in Java
```java
import org.apache.druid.spark.package$.MODULE$.DruidDataFrame
DruidDataFrame(dataset).partitionAndWrite("__time", "millis", "DAY", 200000).format("druid").mode(SaveMode.Overwrite).options(map).save();
```
)
If you don't want to use implicits/wrapper classes, you can also use the partitioner directly:
```java
SingleDimensionPartitioner partitioner = new SingleDimensionPartitioner(dataset);
Dataset<Row> partitionedDataSet = partitioner.partition("__time", "millis", "DAY", 200000, "dim1", true);
partitionedDataset.write().format("druid").mode(SaveMode.Overwrite).options(map).save();
```
Also, are you setting `writer.version` in your options map? I'm surprised to see the segments differ in version between each partition. That's what's causing the partitions to overshadow each other.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org