You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bo Meng (JIRA)" <ji...@apache.org> on 2016/04/27 17:48:12 UTC

[jira] [Commented] (SPARK-14955) JDBCRelation should report an IllegalArgumentException if stride equals 0

    [ https://issues.apache.org/jira/browse/SPARK-14955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260352#comment-15260352 ] 

Bo Meng commented on SPARK-14955:
---------------------------------

I will take a look to see what can be improved.

> JDBCRelation should report an IllegalArgumentException if stride equals 0
> -------------------------------------------------------------------------
>
>                 Key: SPARK-14955
>                 URL: https://issues.apache.org/jira/browse/SPARK-14955
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.1, 1.6.1
>            Reporter: Yang Juan hu
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In file https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> row 56 and 57 has following line
> val stride: Long = (partitioning.upperBound / numPartitions
>                       - partitioning.lowerBound / numPartitions)
> if we invoke columnPartition as below: 
> columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );
> columnPartition will generate following where condition:
> whereClause: partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0
> it will cause data skew, the last partition will contain all data.
> Propose to throw an exception if stride equal 0, help spark user to aware  data skew issue ASAP.
> if (stride == 0) return throw new IllegalArgumentException("partitioning.upperBound / numPartitions -  partitioning.lowerBound / numPartitions is zero");
> partitionColumn must be an integral type, if we want to load a big table from DBMS,  we need to do some work around.
> Real case to export data from ORACLE database through pyspark.
> #data skew issue version
> df=ssc.read.format("jdbc").options( url=url,
>     dbtable="( SELECT ORA_HASH(PART_COL,7)  AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
>     fetchSize="1000",
>     partitionColumn="PART_ID",
>     numPartitions="8",
>     lowerBound="0",
>     upperBound="7").load()
> #no data skew issue version
> df=ssc.read.format("jdbc").options( url=url,
>     dbtable="( SELECT ORA_HASH(PART_COL,7)+1  AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS",
>     fetchSize="1000",
>     partitionColumn="PART_ID",
>     numPartitions="8",
>     lowerBound="1",
>     upperBound="8").load()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org