You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/12/14 05:04:31 UTC

[GitHub] [druid] JulianJaffePinterest edited a comment on pull request #10920: Spark Direct Readers and Writers for Druid.

JulianJaffePinterest edited a comment on pull request #10920:
URL: https://github.com/apache/druid/pull/10920#issuecomment-993161134


   Calling `.partitionBy` on a `DataFrameWriter` (what you get when you call `.write()` on a DataFrame`) doesn't do anything for a v2 data source that doesn't have a managed catalog, which Druid does not (see #11929 for a recent example). The [docs](https://github.com/apache/druid/blob/8392f87236d4a9795aa4e2867eea18cdf0aeb8ec/docs/operations/spark.md#writer) have a more in-depth discussion of partitioning, but the short version is that you'll either need to partition your dataframe before calling `.write()` on it or use one of the `DruidDataFrame` wrapper's convenience methods (for example,
   
   ```scala
   import org.apache.druid.spark.DruidDataFrame
   
   df.partitionAndWrite("__time", "millis", "DAY", 200000).format("druid").mode(SaveMode.Overwrite).options(map).save()
   ```
   or in Java
   ```java
   import org.apache.druid.spark.package$.MODULE$.DruidDataFrame
   
   DruidDataFrame(dataset).partitionAndWrite("__time", "millis", "DAY", 200000).format("druid").mode(SaveMode.Overwrite).options(map).save();
   ```
   )
   
   If you don't want to use implicits/wrapper classes, you can also use the partitioner directly:
   ```java
   SingleDimensionPartitioner partitioner = new SingleDimensionPartitioner(dataset);
   Dataset<Row> partitionedDataSet = partitioner.partition("__time", "millis", "DAY", 200000, "dim1", true);
   partitionedDataset.write().format("druid").mode(SaveMode.Overwrite).options(map).save();
   ```
   
   Also, are you setting `writer.version` in your options map? I'm surprised to see the segments differ in version between each partition. That's what's causing the partitions to overshadow each other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org