You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Congling Xia (Jira)" <ji...@apache.org> on 2019/12/30 08:10:01 UTC

[jira] [Updated] (KYLIN-4320) number of replicas of Cuboid files cannot be configured for Spark engine

     [ https://issues.apache.org/jira/browse/KYLIN-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Congling Xia updated KYLIN-4320:
--------------------------------
    Description: 
Hi, team. I try to change `dfs.replication` to 3 by adding the following config override
{code:java}
kylin.engine.spark-conf.spark.hadoop.dfs.replication=3
{code}
Then, I get a strange result - numbers of replicas of cuboid files varies even though they are in the same level.

!cuboid_replications.png!

I guess it is due to the conflicting settings in SparkUtil:
{code:java}
public static void modifySparkHadoopConfiguration(SparkContext sc) throws Exception {
    sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid intermediate files, replication=2
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress", "true");
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.DefaultCodec"); // or org.apache.hadoop.io.compress.SnappyCodec
}
{code}
It may be a bug for Spark property precedence. After checking [Spark documents|#dynamically-loading-spark-properties]], it seems that some programmatically set properties may not take effect and it is not a recommended way for Spark job configuration.

 

Anyway, cuboid files may survive for weeks until expired or been merged, the configuration rewrite in `org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` makes those files less reliable.

Is there any way to force cuboid files to remain 3 replicas? or shall we remove the code in SparkUtil to make kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly?

  was:
Hi, team. I try to change `dfs.replication` to 3 by adding the following config override
{code:java}
kylin.engine.spark-conf.spark.hadoop.dfs.replication=3
{code}
Then, I get a strange result - numbers of replicas of cuboid files varies even though they are in the same level.

!cuboid_replications.png!

I guess it is due to the conflicting settings is SparkUtil:
{code:java}
public static void modifySparkHadoopConfiguration(SparkContext sc) throws Exception {
    sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid intermediate files, replication=2
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress", "true");
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.DefaultCodec"); // or org.apache.hadoop.io.compress.SnappyCodec
}
{code}
It may be a bug for Spark property precedence. After checking [Spark documents|[http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]], it seems that some programmatically set properties may not take effect and it is not a recommended way for Spark job configuration.

 

Anyway, cuboid files may survive for weeks until expired or been merged, the configuration rewrite in `org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` makes those files less reliable.

Is there any way to force cuboid files to remain 3 replicas? or shall we remove the code in SparkUtil to make kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly?


> number of replicas of Cuboid files cannot be configured for Spark engine
> ------------------------------------------------------------------------
>
>                 Key: KYLIN-4320
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4320
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>            Reporter: Congling Xia
>            Priority: Major
>         Attachments: cuboid_replications.png
>
>
> Hi, team. I try to change `dfs.replication` to 3 by adding the following config override
> {code:java}
> kylin.engine.spark-conf.spark.hadoop.dfs.replication=3
> {code}
> Then, I get a strange result - numbers of replicas of cuboid files varies even though they are in the same level.
> !cuboid_replications.png!
> I guess it is due to the conflicting settings in SparkUtil:
> {code:java}
> public static void modifySparkHadoopConfiguration(SparkContext sc) throws Exception {
>     sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid intermediate files, replication=2
>     sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress", "true");
>     sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
>     sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.DefaultCodec"); // or org.apache.hadoop.io.compress.SnappyCodec
> }
> {code}
> It may be a bug for Spark property precedence. After checking [Spark documents|#dynamically-loading-spark-properties]], it seems that some programmatically set properties may not take effect and it is not a recommended way for Spark job configuration.
>  
> Anyway, cuboid files may survive for weeks until expired or been merged, the configuration rewrite in `org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` makes those files less reliable.
> Is there any way to force cuboid files to remain 3 replicas? or shall we remove the code in SparkUtil to make kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)