You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zack Behringer (JIRA)" <ji...@apache.org> on 2017/10/17 18:55:01 UTC

[jira] [Created] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping

Zack Behringer created SPARK-22299:
--------------------------------------

             Summary: Use OFFSET and LIMIT for JDBC DataFrameReader striping
                 Key: SPARK-22299
                 URL: https://issues.apache.org/jira/browse/SPARK-22299
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 1.4.0
            Reporter: Zack Behringer
            Priority: Minor


Loading a large table (300M rows) from JDBC can be partitioned into tasks using the column, numPartitions, lowerBound and upperBound parameters on DataFrameReader.jdbc(), but that becomes troublesome if the column is skewed/fragmented (as in somebody used a global sequence for the partition column instead of a sequence specific to the table, or if the table becomes fragmented by deletes, etc.).
This can be worked around by using a modulus operation on the column, but that will be slow unless there is a already an index using the modulus expression with the exact numPartitions value, so that doesn't scale well if you want to change the number partitions. Another way would be to use an expression index on a hash of the partition column, but I'm not sure if JDBC striping is smart enough to create hash ranges for each stripe using hashes of the lower and upper bound parameters. If it is, that is great, but still that requires a very large index just for this use case.

A less invasive approach would be to use the table's physical ordering along with OFFSET and LIMIT so that only the total number of records to read would need to be known beforehand in order to evenly distribute, no indexes needed. I realize that OFFSET and LIMIT are not standard SQL keywords.

I also see that a list of custom predicates can be defined. I haven't tried that to see if I can embed numPartitions specific predicates each with their own OFFSET and LIMIT range.

Some relational databases take quite a long time to count the number of records in order to determine the stripe size, though, so this can also troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" be used in conjunction with the number of executors to read manageable batches (internally using OFFSET and LIMIT) until there are no more available results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org