You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "BELUGA BEHR (JIRA)" <ji...@apache.org> on 2017/08/06 18:51:00 UTC
[jira] [Comment Edited] (HIVE-16758) Better Select Number of
Replications
[ https://issues.apache.org/jira/browse/HIVE-16758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115898#comment-16115898 ]
BELUGA BEHR edited comment on HIVE-16758 at 8/6/17 6:50 PM:
------------------------------------------------------------
[~csun] Thank you for the feedback.
The reason I came across this issue in the first place was that I had to perform some tests using Hive-on-Spark, on a 3-node clusters. I was repeatedly having the inconvenience of my queries failing because the default value of 10 was larger than my 3-node cluster, thus causing my queries to immediately fail as my {{dfs.replication.max}} was set to 3. After failing, I would have to set {{dfs.replication.max}} to a value of 10 to continue my testing, not because it was the appropriate value, but because Hive-on-Spark wouldn't work otherwise. We should be allowing users to use Hive-on-Spark without additional configuration on a 3 node cluster. Scaling Hive-on-Spark should require additional configuration, not the other way around.
I can change the variable name.
It's not my call regarding {{mapred.submit.replication}}. However, since in this context it was not already being used, I would not recommend moving forward with introducing a deprecated configuration into new code..
was (Author: belugabehr):
[~csun] Thank you for the feedback.
The reason I came across this issue in the first place was that I had to perform some tests using Hive-on-Spark, on a 3-node clusters. I was repeatedly having the inconvenience of my queries failing because the default value of 10 was larger than my 3-node cluster, thus causing my queries to immediately fail as my {{dfs.replication.max}} was set to 3. After failing, I would have to set {{dfs.replication.max}} to a value of 10 to continue my testing. We should be allowing users to use Hive-on-Spark without additional configuration on a 3 node cluster. Scaling Hive-on-Spark should require additional configuration, not the other way around.
I can change the variable name.
It's not my call regarding {{mapred.submit.replication}}. However, since in this context it was not already being used, I would not recommend moving forward with introducing a deprecated configuration into new code..
> Better Select Number of Replications
> ------------------------------------
>
> Key: HIVE-16758
> URL: https://issues.apache.org/jira/browse/HIVE-16758
> Project: Hive
> Issue Type: Improvement
> Reporter: BELUGA BEHR
> Assignee: BELUGA BEHR
> Priority: Minor
> Attachments: HIVE-16758.1.patch
>
>
> {{org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.java}}
> We should be smarter about how we pick a replication number. We should add a new configuration equivalent to {{mapreduce.client.submit.file.replication}}. This value should be around the square root of the number of nodes and not hard-coded in the code.
> {code}
> public static final String DFS_REPLICATION_MAX = "dfs.replication.max";
> private int minReplication = 10;
> @Override
> protected void initializeOp(Configuration hconf) throws HiveException {
> ...
> int dfsMaxReplication = hconf.getInt(DFS_REPLICATION_MAX, minReplication);
> // minReplication value should not cross the value of dfs.replication.max
> minReplication = Math.min(minReplication, dfsMaxReplication);
> }
> {code}
> https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)