You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/17 04:57:40 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #31858: [SPARK-34768 ][SQL] Respect the default input buffer size in Univocity

HyukjinKwon opened a new pull request #31858:
URL: https://github.com/apache/spark/pull/31858


   ### What changes were proposed in this pull request?
   
   This PR proposes to follow Univocity's input buffer.
   
   ### Why are the changes needed?
   
   - Firstly, it's best to trust their judgement on the default values. Also 128 is too low.
   - Default values arguably have more test coverage in Univocity.
   - It will also fix https://github.com/uniVocity/univocity-parsers/issues/449
   - ^ is a regression compared to Spark 2.4
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. In addition, It fixes a regression.
   
   ### How was this patch tested?
   
   Manually tested, and added a unit test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800873420


   > It would be nice to re-run CSV benchmarks.
   
   the fix will have to be ported back through branch-3.1. I would do it separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon edited a comment on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon edited a comment on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800873420


   > It would be nice to re-run CSV benchmarks.
   
   the fix will have to be ported back through branch-3.0. I would do it separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800848162


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136144/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800988031


   Thanks, Max. Merged to master, branch-3.1 and branch-3.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on a change in pull request #31858:
URL: https://github.com/apache/spark/pull/31858#discussion_r595737885



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -166,8 +166,6 @@ class CSVOptions(
 
   val quoteAll = getBool("quoteAll", false)
 
-  val inputBufferSize = 128

Review comment:
       The default is `1024*1024 characters`. I just wonder how much we will increase memory consumption in this way.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #31858:
URL: https://github.com/apache/spark/pull/31858


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800881345


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40732/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31858:
URL: https://github.com/apache/spark/pull/31858#discussion_r595740479



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -166,8 +166,6 @@ class CSVOptions(
 
   val quoteAll = getBool("quoteAll", false)
 
-  val inputBufferSize = 128

Review comment:
       This controls the buffer size in the reader (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/input/DefaultCharInputReader.java). So I think it will increase from 128 bytes to 1 MB which shouldn't be a big deal.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800865318


   Thanks man. Yeah, this bandaids the issue (rather as its side effect). I believe it's better to use default buffer size for stability, potentially better performance, etc. in any event.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768 ][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800794256


   cc @MaxGekk can you take a look please?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800848941


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40726/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800872570


   Yeah, looks like this doesn't exist in Spark 2.4 according to our internal report. It does fix the specific case by increasing the limit. It's just a bandaid fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-803736372


   I performed the benchmark before after this commit, and I do see the perf improvement here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800812579






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on a change in pull request #31858:
URL: https://github.com/apache/spark/pull/31858#discussion_r595736246



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2452,6 +2452,27 @@ abstract class CSVSuite
       assert(result.sameElements(exceptResults))
     }
   }
+
+  test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") {
+    val line = "XX   |XXX-XXXX            |XXXXXX              " +
+      "|XXXXXXXX|XXXXX               |XXXXXX              " +
+      "|X|XXXXXXX|XXXXXXXX|XXXX|XXXXXXXXXXXXXXX     |XXXXXXXXXXX" +
+      "|XXXXXX              |XXXXXXXXXXXXXXXXXXXXXX|XXXXXX              " +
+      "|XXXXXXXXXXXXXX|XXXXXX              |XXXXXXXXXXXXXXXXXXXXXX" +
+      "|XXXXXX              |XXXXXXXXXXXXXXXXXXXXXX|XXXXXX              " +
+      "|XXXXXXXXX|XXXXXX              |XXXXXXX|                    " +
+      "||                    ||                    " +
+      "||                    ||XXXX-XX-XX XX:XX:XX.XXXXXXX" +
+      "||XXXXX.XXXXXXXXXXXXXXX|XXXXX.XXXXXXXXXXXXXX" +
+      "|XXXXX.XXXXXXXXXXXXXXX|X|XXXXXX              |X"
+    withTempPath { path =>
+      Seq(line).toDF.write.text(path.getAbsolutePath)
+      assert(spark.read.format("csv")
+        .option("delimiter", "|")
+        .option("inferSchema", "true")

Review comment:
       Do you need this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800880331


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136150/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-803736270


   ```diff
    Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Parsing quoted values:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
   -One quoted string                                 30131          31843        1489          0.0      602627.2       1.0X
   +One quoted string                                 24185          24195          10          0.0      483694.2       1.0X
   
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
    Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Wide rows with 1000 columns:              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
   -Select 1000 columns                               66630          68022        1345          0.0       66630.3       1.0X
   -Select 100 columns                                27846          27948          95          0.0       27846.1       2.4X
   -Select one column                                 23184          23574         415          0.0       23184.5       2.9X
   -count()                                            6179           6272         151          0.2        6179.1      10.8X
   -Select 100 columns, one bad input field           45030          46637        1421          0.0       45029.5       1.5X
   -Select 100 columns, corrupt record field          54971          56153        1428          0.0       54971.4       1.2X
   +Select 1000 columns                               61793          62388         532          0.0       61793.4       1.0X
   +Select 100 columns                                21958          21993          34          0.0       21957.9       2.8X
   +Select one column                                 18215          18515         505          0.1       18215.0       3.4X
   +count()                                            5865           6168         296          0.2        5865.1      10.5X
   +Select 100 columns, one bad input field           39638          39739         124          0.0       39637.5       1.6X
   +Select 100 columns, corrupt record field          47290          48133         741          0.0       47290.0       1.3X
   
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
    Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Count a dataset with 10 columns:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
   -Select 10 columns + count()                       10923          11008          97          0.9        1092.3       1.0X
   -Select 1 column + count()                          7411           7567         138          1.3         741.1       1.5X
   -count()                                            2231           2281          43          4.5         223.1       4.9X
   +Select 10 columns + count()                        9935          10460         461          1.0         993.5       1.0X
   +Select 1 column + count()                          6786           7179         342          1.5         678.6       1.5X
   +count()                                            2281           2458         165          4.4         228.1       4.4X
   
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
    Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Write dates and timestamps:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
   -Create a dataset of timestamps                      835            874          34         12.0          83.5       1.0X
   -to_csv(timestamp)                                  7808           8024         191          1.3         780.8       0.1X
   -write timestamps to files                          6935           7201         239          1.4         693.5       0.1X
   -Create a dataset of dates                           947            980          28         10.6          94.7       0.9X
   -to_csv(date)                                       5058           5118          54          2.0         505.8       0.2X
   -write dates to files                               3964           4026          62          2.5         396.4       0.2X
   +Create a dataset of timestamps                      812            826          14         12.3          81.2       1.0X
   +to_csv(timestamp)                                  7548           7764         192          1.3         754.8       0.1X
   +write timestamps to files                          7052           7193         141          1.4         705.2       0.1X
   +Create a dataset of dates                           897            909          13         11.1          89.7       0.9X
   +to_csv(date)                                       4778           4787          10          2.1         477.8       0.2X
   +write dates to files                               3853           3891          33          2.6         385.3       0.2X
   
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
    Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Read dates and timestamps:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
   -read timestamp text from files                     1272           1296          34          7.9         127.2       1.0X
   -read timestamps from files                        22376          22850         429          0.4        2237.6       0.1X
   -infer timestamps from files                       44109          44455         345          0.2        4410.9       0.0X
   -read date text from files                          1127           1136           8          8.9         112.7       1.1X
   -read date from files                              10840          11082         245          0.9        1084.0       0.1X
   -infer date from files                             13967          14293         424          0.7        1396.7       0.1X
   -timestamp strings                                  1855           1945          91          5.4         185.5       0.7X
   -parse timestamps from Dataset[String]             23368          23580         185          0.4        2336.8       0.1X
   -infer timestamps from Dataset[String]             46081          46810         633          0.2        4608.1       0.0X
   -date strings                                       1867           1962          93          5.4         186.7       0.7X
   -parse dates from Dataset[String]                  12308          12349          36          0.8        1230.8       0.1X
   -from_csv(timestamp)                               23333          24201        1401          0.4        2333.3       0.1X
   -from_csv(date)                                    11734          11898         142          0.9        1173.4       0.1X
   +read timestamp text from files                     1259           1262           4          7.9         125.9       1.0X
   +read timestamps from files                        20030          20105          80          0.5        2003.0       0.1X
   +infer timestamps from files                       39621          39674          61          0.3        3962.1       0.0X
   +read date text from files                          1039           1068          40          9.6         103.9       1.2X
   +read date from files                               9352           9363          10          1.1         935.2       0.1X
   +infer date from files                             11465          11485          23          0.9        1146.5       0.1X
   +timestamp strings                                  1759           1812          59          5.7         175.9       0.7X
   +parse timestamps from Dataset[String]             20806          20858          75          0.5        2080.6       0.1X
   +infer timestamps from Dataset[String]             40537          40821         258          0.2        4053.7       0.0X
   +date strings                                       1808           1816          12          5.5         180.8       0.7X
   +parse dates from Dataset[String]                  12080          12311         245          0.8        1208.0       0.1X
   +from_csv(timestamp)                               20120          21503        1224          0.5        2012.0       0.1X
   +from_csv(date)                                    10607          10768         246          0.9        1060.7       0.1X
   
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7
    Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
   -w/o filters                                       12952          13053         157          0.0      129515.9       1.0X
   -pushdown disabled                                 12794          12820          42          0.0      127939.7       1.0X
   -w/ filters                                         1141           1181          35          0.1       11414.2      11.3X
   +w/o filters                                       13109          13249         151          0.0      131086.4       1.0X
   +pushdown disabled                                 12951          12994          63          0.0      129509.7       1.0X
   +w/ filters                                         1095           1113          15          0.1       10953.7      12.0X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #31858: [SPARK-34768][SQL] Respect the default input buffer size in Univocity

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on pull request #31858:
URL: https://github.com/apache/spark/pull/31858#issuecomment-800870879


   >  I believe it's better to use default buffer size for stability,
   
   I agree.
   
   > potentially better performance, etc. in any event.
   
   It would be nice to re-run CSV benchmarks. Though we can do that separately from this PR.
   
   > It will also fix uniVocity/univocity-parsers#449
   
   Could you update PR's description as it doesn't fix the issue.
   
   > ^ is a regression compared to Spark 2.4
   
   Are you sure about this. The related code in uniVocity is old. I guess 2.4 might have the same problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org