You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/13 06:44:10 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

HyukjinKwon opened a new pull request #32145:
URL: https://github.com/apache/spark/pull/32145


   ### What changes were proposed in this pull request?
   
   This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449.
   
   ### Why are the changes needed?
   
   To work around uniVocity/univocity-parsers#449.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's only internal option.
   
   ### How was this patch tested?
   
   Manually tested by modifying the unittest added in https://github.com/apache/spark/pull/31858 as below:
   
   ```diff
   diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
   index fd25a79619d..b58f0bd3661 100644
   --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
   +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
   @@ -2460,6 +2460,7 @@ abstract class CSVSuite
          Seq(line).toDF.write.text(path.getAbsolutePath)
          assert(spark.read.format("csv")
            .option("delimiter", "|")
   +        .option("inputBuffer", "128")
            .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1)
        }
      }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818515111


   **[Test build #137282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137282/testReport)** for PR 32145 at commit [`f1f92fb`](https://github.com/apache/spark/commit/f1f92fb33636ac4edf9c4d3a72beeaa3cdfc0637).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818554465


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41860/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-819145960


   Thx Max.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818565946


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41861/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32145:
URL: https://github.com/apache/spark/pull/32145#discussion_r612176978



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -211,6 +211,8 @@ class CSVOptions(
   }
   val lineSeparatorInWrite: Option[String] = lineSeparator
 
+  val inputBufferSize: Option[Int] = parameters.get("inputBuffer").map(_.toInt)

Review comment:
       ```suggestion
     val inputBufferSize: Option[Int] = parameters.get("inputBufferSize").map(_.toInt)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818565946


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41861/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818515111


   **[Test build #137282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137282/testReport)** for PR 32145 at commit [`f1f92fb`](https://github.com/apache/spark/commit/f1f92fb33636ac4edf9c4d3a72beeaa3cdfc0637).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818677264


   **[Test build #137281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137281/testReport)** for PR 32145 at commit [`2420be1`](https://github.com/apache/spark/commit/2420be1ba077971740b896b399dfeb874d929c5e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818739445


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137282/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818561544


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41861/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #32145:
URL: https://github.com/apache/spark/pull/32145#discussion_r612174769



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -211,6 +211,8 @@ class CSVOptions(
   }
   val lineSeparatorInWrite: Option[String] = lineSeparator
 
+  val inputBuffer: Option[Int] = parameters.get("inputBuffer").map(_.toInt)

Review comment:
       maybe `inputBufferSize`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818544500


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41860/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818739445


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137282/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818699077


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137281/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818554465


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41860/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818683506


   +1, LGTM. Merging to master/3.1/3.0.
   Thank you @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818565895


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41861/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818484926


   **[Test build #137281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137281/testReport)** for PR 32145 at commit [`2420be1`](https://github.com/apache/spark/commit/2420be1ba077971740b896b399dfeb874d929c5e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818484926


   **[Test build #137281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137281/testReport)** for PR 32145 at commit [`2420be1`](https://github.com/apache/spark/commit/2420be1ba077971740b896b399dfeb874d929c5e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
MaxGekk closed pull request #32145:
URL: https://github.com/apache/spark/pull/32145


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32145:
URL: https://github.com/apache/spark/pull/32145#discussion_r612176484



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -257,6 +259,7 @@ class CSVOptions(
     settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)
     settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)
     settings.setReadInputOnSeparateThread(false)
+    inputBuffer.foreach(settings.setInputBufferSize)

Review comment:
       ```suggestion
       inputBufferSize.foreach(settings.setInputBufferSize)
   ```

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -211,6 +211,8 @@ class CSVOptions(
   }
   val lineSeparatorInWrite: Option[String] = lineSeparator
 
+  val inputBuffer: Option[Int] = parameters.get("inputBuffer").map(_.toInt)

Review comment:
       ```suggestion
     val inputBufferSize: Option[Int] = parameters.get("inputBuffer").map(_.toInt)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818699077


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137281/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32145: [SPARK-35045][SQL] Add an internal option to control input buffer in univocity

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32145:
URL: https://github.com/apache/spark/pull/32145#issuecomment-818727032


   **[Test build #137282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137282/testReport)** for PR 32145 at commit [`f1f92fb`](https://github.com/apache/spark/commit/f1f92fb33636ac4edf9c4d3a72beeaa3cdfc0637).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org