You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/11/25 09:25:31 UTC

[GitHub] [spark] akhalymon-cv opened a new pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

akhalymon-cv opened a new pull request #34709:
URL: https://github.com/apache/spark/pull/34709


   ### What changes were proposed in this pull request?
   This PR adds the boolean option 'useRawQuery' to unwrap the query from 'select' statement, used to get schema, as it causes problems with CTE when using MSSQL JDBC driver. When set to 'true', the user has to also provide a schema of result in 'customSchema' option. The obvious downside of this approach is that the user is obligated to fetch schema beforehand when running the query. The advantage is that we are avoiding running query twice just to get schema, and users can run query without modification, unlike the solution described in https://github.com/apache/spark/pull/34693
   
   
   ### Why are the changes needed?
   These changes are needed to support of CTE while using MSSQL JDBC
   
   
   ### Does this PR introduce _any_ user-facing change?
   Added optiont 'useRawQuery' to JDBC options. Example:
   ```
   JdbcMsSqlDF = (
       spark.read.format("jdbc")
       .option("url", f"jdbc:sqlserver://{server}:{port};databaseName={database};")
       .option("user", user)
       .option("password", password)
       .option("useRawQuery", "true")  //<----------- Do not wrap the query
       .option("customSchema", schema)
       .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
       .option("query", query)
       .load()
   )
   ```
   
   
   ### How was this patch tested?
   The patch was tested manually, unit tests  are pending, it's still WIP.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-979111758


   > As far as I understand now, we will be not able to add partitioning without nesting the user query, as we are already doing. @peter-toth what do you think?
   
   Either we nest the user query (but MSSQL doesn't support it if the query has a CTE clause) or we split the query to a WITH and a SELECT clauses as my PR: https://github.com/apache/spark/pull/34693 does. But splitting a query programatically is quite tough so I let the user to do it...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-979051704


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #34709:
URL: https://github.com/apache/spark/pull/34709


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-1063512236


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-979062037


   My main issue with this approach is that you ignore the partition (`thePart` / `myWhereClause`) in `JDBCRDD .compute()` and all tasks run the exact same query so this could work only if we have 1 partition. We would also lose limit/sample/aggregate pushdown capability... 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] KevinAppelBofa edited a comment on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

KevinAppelBofa edited a comment on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-981732735


   @akhalymon-cv thank you for working on the patch, I was able to test this and both the test queries and the actual temp and CTE queries are working.  the item in the example you show using query however if I do this, then I get an error and in instead the query must be passed via the dbtable option
   For the temp table query this is having a similar run time as the patches from @peter-toth has created; the temp table was easy to split but I haven't been able to split the CTE yet for a comparison.
   For these particular queries, they are so complicated currently that it would be difficult to find a way to split them up for parallel; currently we are using either predicates or numpartitions when we are doing the parallel of the jdbc.
   
   Error when using query vs dbtable
   `com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'.
           at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
           at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
           at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
           at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
           at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)
           at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3272)
           at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247)
           at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222)
           at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:446)
           at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:382)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
           at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
           at org.apache.spark.scheduler.Task.run(Task.scala:136)
           at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
           at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] KevinAppelBofa commented on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

KevinAppelBofa commented on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-981732735


   @akhalymon-cv thank you for working on the patch, I was able to test this and both the test queries and the actual temp and CTE queries are working.  the item in the example you show using query however if I do this, then I get an error and in instead the query must be passed via the dbtable option
   For the temp table query this is having a similar run time as the patches from @peter-toth has created; the temp table was easy to split but I haven't been able to split the CTE yet for a comparison.
   For these particular queries, they are so complicated currently that it would be difficult to find a way to split them up for parallel; currently we are using either predicates or numpartitions when we are doing the parallel of the jdbc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-979062037


   My main issue with this approach is that you ignore the partition (`thePart` / `myWhereClause`) in `compute()` and all tasks run the exact same query so this could work only if we have 1 partition. We would also lose limit/sample/aggregate pushdown capability... 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] akhalymon-cv commented on pull request #34709: [WIP][SPARK-37259] Add option to unwrap query to support CTE for MSSQL JDBC driver

Posted by GitBox <gi...@apache.org>.

akhalymon-cv commented on pull request #34709:
URL: https://github.com/apache/spark/pull/34709#issuecomment-979084706


   @peter-toth thanks for looking into the patch. Regarding partitioning - that's a valid point, I pruned myWhereClause too fast without considering it handles partitioning. Sampling is only supported by Postgres, so it doesn't affect very this use case much. As for limiting, maybe the user can specify it manually in the query? One thing I can think of is modifying the query to include partitioning conditions, which can be not easy to do. As far as I understand now, we will be not able to add partitioning without nesting the user query, as we are already doing. @peter-toth what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org