You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/04/27 20:42:13 UTC
[jira] [Assigned] (SPARK-14955) JDBCRelation should report an
IllegalArgumentException if stride equals 0
[ https://issues.apache.org/jira/browse/SPARK-14955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-14955:
------------------------------------
Assignee: (was: Apache Spark)
> JDBCRelation should report an IllegalArgumentException if stride equals 0
> -------------------------------------------------------------------------
>
> Key: SPARK-14955
> URL: https://issues.apache.org/jira/browse/SPARK-14955
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.5.1, 1.6.1
> Reporter: Yang Juan hu
> Priority: Minor
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> In file https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> row 56 and 57 has following line
> val stride: Long = (partitioning.upperBound / numPartitions
> - partitioning.lowerBound / numPartitions)
> if we invoke columnPartition as below:
> columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );
> columnPartition will generate following where condition:
> whereClause: partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0
> it will cause data skew, the last partition will contain all data.
> Propose to throw an exception if stride equal 0, help spark user to aware data skew issue ASAP.
> if (stride == 0) return throw new IllegalArgumentException("partitioning.upperBound / numPartitions - partitioning.lowerBound / numPartitions is zero");
> partitionColumn must be an integral type, if we want to load a big table from DBMS, we need to do some work around.
> Real case to export data from ORACLE database through pyspark.
> #data skew issue version
> df=ssc.read.format("jdbc").options( url=url,
> dbtable="( SELECT ORA_HASH(PART_COL,7) AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
> fetchSize="1000",
> partitionColumn="PART_ID",
> numPartitions="8",
> lowerBound="0",
> upperBound="7").load()
> #no data skew issue version
> df=ssc.read.format("jdbc").options( url=url,
> dbtable="( SELECT ORA_HASH(PART_COL,7)+1 AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
> fetchSize="1000",
> partitionColumn="PART_ID",
> numPartitions="8",
> lowerBound="1",
> upperBound="8").load()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org