You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yang Juan hu (JIRA)" <ji...@apache.org> on 2016/04/27 12:47:13 UTC
[jira] [Created] (SPARK-14955) JDBCRelation should report an IllegalArgumentException if stride equals 0

Yang Juan hu created SPARK-14955:
------------------------------------

             Summary: JDBCRelation should report an IllegalArgumentException if stride equals 0
                 Key: SPARK-14955
                 URL: https://issues.apache.org/jira/browse/SPARK-14955
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.6.1, 1.5.1
            Reporter: Yang Juan hu
            Priority: Minor


In file https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

row 56 and 57 has following line

val stride: Long = (partitioning.upperBound / numPartitions
                      - partitioning.lowerBound / numPartitions)

if we invoke columnPartition as below: 

columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );


columnPartition will generate following where condition:

whereClause: partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0

it will cause data skew, the last partition will contain all data.

Propose to throw an exception if stride equal 0, help spark user to aware  data skew issue ASAP.

if (stride == 0) return throw new IllegalArgumentException("partitioning.upperBound / numPartitions -  partitioning.lowerBound / numPartitions is zero");

partitionColumn must be an integral type, if we want to load a big table from DBMS,  we need to do some work around.


Real case to export data from ORACLE database through pyspark.

#data skew issue version

df=ssc.read.format("jdbc").options( url=url,
    dbtable="( SELECT ORA_HASH(PART_COL,7)  AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
    fetchSize="1000",
    partitionColumn="PART_ID",
    numPartitions="8",
    lowerBound="0",
    upperBound="7").load()


#no data skew issue version
df=ssc.read.format("jdbc").options( url=url,
    dbtable="( SELECT ORA_HASH(PART_COL,7)+1  AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
    fetchSize="1000",
    partitionColumn="PART_ID",
    numPartitions="8",
    lowerBound="1",
    upperBound="8").load()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org