You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yang Juan hu (JIRA)" <ji...@apache.org> on 2016/04/27 12:47:13 UTC
[jira] [Created] (SPARK-14955) JDBCRelation should report an
IllegalArgumentException if stride equals 0
Yang Juan hu created SPARK-14955:
------------------------------------
Summary: JDBCRelation should report an IllegalArgumentException if stride equals 0
Key: SPARK-14955
URL: https://issues.apache.org/jira/browse/SPARK-14955
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.6.1, 1.5.1
Reporter: Yang Juan hu
Priority: Minor
In file https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
row 56 and 57 has following line
val stride: Long = (partitioning.upperBound / numPartitions
- partitioning.lowerBound / numPartitions)
if we invoke columnPartition as below:
columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );
columnPartition will generate following where condition:
whereClause: partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0
it will cause data skew, the last partition will contain all data.
Propose to throw an exception if stride equal 0, help spark user to aware data skew issue ASAP.
if (stride == 0) return throw new IllegalArgumentException("partitioning.upperBound / numPartitions - partitioning.lowerBound / numPartitions is zero");
partitionColumn must be an integral type, if we want to load a big table from DBMS, we need to do some work around.
Real case to export data from ORACLE database through pyspark.
#data skew issue version
df=ssc.read.format("jdbc").options( url=url,
dbtable="( SELECT ORA_HASH(PART_COL,7) AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
fetchSize="1000",
partitionColumn="PART_ID",
numPartitions="8",
lowerBound="0",
upperBound="7").load()
#no data skew issue version
df=ssc.read.format("jdbc").options( url=url,
dbtable="( SELECT ORA_HASH(PART_COL,7)+1 AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
fetchSize="1000",
partitionColumn="PART_ID",
numPartitions="8",
lowerBound="1",
upperBound="8").load()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org