You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/04/02 14:49:21 UTC

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20959

    [SPARK-23846][SQL] The samplingRatio option for CSV datasource

    ## What changes were proposed in this pull request?
    
    I propose to support the `samplingRatio` option for schema inferring of CSV datasource similar to the same option of JSON datasource: 
    https://github.com/apache/spark/blob/b14993e1fcb68e1c946a671c6048605ab4afdf58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L49-L50
    
    ## How was this patch tested?
    
    Added 2 tests for json and 2 tests for csv datasources. The tests checks that only subset of input dataset is used for schema inferring. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 csv-sampling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20959
    
----
commit 78160368a94651a1c9d92660f6f00e0a12d8e324
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T13:28:32Z

    Adding samplingRation tests for json

commit d79976feb001606e8103813984ecaddf8a66f847
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T13:34:06Z

    Removing debug code

commit bb5cfeee18c6ce17efd99642757a1c9b6e2c1145
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T13:43:16Z

    Porting samplingRatio test from JSONSuite

commit da9813b83e3c9bd35520c82e868a2493af0315a4
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T14:00:21Z

    Support the samplingRatio option for CSV datasource

commit a0966d0302261c499dc73df7fc2b16c508934e96
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T14:27:16Z

    Adding description of samplingRation to the csv and json methods of DataFrameReader

commit ba12fca8514a2cb620224e859e8a8b5cc208bf31
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T14:41:08Z

    Adding ticket number to test titles

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88827/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89220/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89220/testReport)** for PR 20959 at commit [`1427f73`](https://github.com/apache/spark/commit/1427f73e13b5809ac9622bf5fbd80cb178544864).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r182607885
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -161,7 +161,8 @@ object TextInputCSVDataSource extends CSVDataSource {
           val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
           val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
           val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
    -      val tokenRDD = csv.rdd.mapPartitions { iter =>
    +      val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions)
    +      val tokenRDD = sampled.rdd.mapPartitions { iter =>
    --- End diff --
    
    ```
    $ tree .
    .
    ├── a.json
    ├── b.json
    ├── c.json
    ├── d.json
    ├── e.json
    ├── f.json
    └── h.json
    
    0 directories, 7 files
    ```
    
    ```
    $ cat *
    {"a": "a"}
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    ```
    
    ```scala
    scala> spark.read.option("samplingRatio", 0.1).json("tmp").schema
    res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
    
    scala> spark.read.option("samplingRatio", 1).json("tmp").schema
    res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,true))
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183979243
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -832,7 +832,7 @@ def text(self, path, compression=None, lineSep=None):
         def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None,
                 header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None,
                 timestampFormat=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None,
    -            charToEscapeQuoteEscaping=None):
    +            charToEscapeQuoteEscaping=None, samplingRatio=None):
    --- End diff --
    
    This is in DataFrameWriter .... 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183981927
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1280,72 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRatio < 1.0") {
    +    // Set default values for the DataSource parameters to make sure
    +    // that whole test file is mapped to only one partition. This will guarantee
    +    // reliable sampling of the input file.
    +    withSQLConf(
    +      "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
    +      "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
    +    )(withTempPath { path =>
    +      val ds = sampledTestData.coalesce(1)
    +      ds.write.text(path.getAbsolutePath)
    +
    +      val readback = spark.read
    +        .option("inferSchema", true).option("samplingRatio", 0.1)
    +        .csv(path.getCanonicalPath)
    +      assert(readback.schema == new StructType().add("_c0", IntegerType))
    +    })
    +  }
    +
    +  test("SPARK-23846: usage of samplingRatio while parsing a dataset of strings") {
    +    val ds = sampledTestData.coalesce(1)
    +    val readback = spark.read
    +      .option("inferSchema", true).option("samplingRatio", 0.1)
    +      .csv(ds)
    +
    +    assert(readback.schema == new StructType().add("_c0", IntegerType))
    +  }
    +
    +  test("SPARK-23846: samplingRatio is out of the range (0, 1.0]") {
    +    val ds = spark.range(0, 100, 1, 1).map(_.toString)
    +
    +    val errorMsg0 = intercept[IllegalArgumentException] {
    +      spark.read.option("inferSchema", true).option("samplingRatio", -1).csv(ds)
    +    }.getMessage
    +    assert(errorMsg0.contains("samplingRatio (-1.0) should be greater than 0"))
    +
    +    val errorMsg1 = intercept[IllegalArgumentException] {
    +      spark.read.option("inferSchema", true).option("samplingRatio", 0).csv(ds)
    +    }.getMessage
    +    assert(errorMsg1.contains("samplingRatio (0.0) should be greater than 0"))
    +
    +    val sampled = spark.read.option("inferSchema", true).option("samplingRatio", 10).csv(ds)
    +    assert(sampled.count() == ds.count())
    +  }
    +
    +  test("SPARK-23846: sampling files for schema inferring in the multiLine mode") {
    +    withTempDir { dir =>
    +      Files.write(Paths.get(dir.getAbsolutePath, "0.csv"), "0.1".getBytes,
    --- End diff --
    
    `getBytes` with explicit UTF-8?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #88827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88827/testReport)** for PR 20959 at commit [`b6a7cc8`](https://github.com/apache/spark/commit/b6a7cc852bb7f27a94eb268f4f3e5e35f97538bd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89241/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180940274
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -528,6 +529,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
        * <li>`header` (default `false`): uses the first line as names of columns.</li>
        * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It
        * requires one extra pass over the data.</li>
    +   * <li>`samplingRatio` (default 1.0): the sample ratio of rows used for schema inferring.</li>
    --- End diff --
    
    Yes, please. Seems this option can be useful given your investigation and @rxin's opinion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r182606619
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -161,7 +161,8 @@ object TextInputCSVDataSource extends CSVDataSource {
           val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
           val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
           val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
    -      val tokenRDD = csv.rdd.mapPartitions { iter =>
    +      val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions)
    +      val tokenRDD = sampled.rdd.mapPartitions { iter =>
    --- End diff --
    
    Doesn't JSON do the sampling when multiLine is enabled?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89239/testReport)** for PR 20959 at commit [`d12c2e2`](https://github.com/apache/spark/commit/d12c2e221b446596e2322a4d86ee0fd55a09b6ba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180943744
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1278,68 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRatio < 1.0") {
    +    // Set default values for the DataSource parameters to make sure
    +    // that whole test file is mapped to only one partition. This will guarantee
    +    // reliable sampling of the input file.
    +    withSQLConf(
    +      "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
    +      "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
    +    )(withTempPath { path =>
    +      val rdd = spark.sqlContext.range(0, 100).map {row =>
    --- End diff --
    
    Can we actually do
    
    ```
            val predefinedSample = Set[Long](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
              57, 62, 68, 72)
            val value = row.getLong(0)
            if (predefinedSample.contains(value)) {
              value.toString
            } else {
              (value.toDouble + 0.1).toString
            }
    ```
    
    outside of `map`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #88817 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88817/testReport)** for PR 20959 at commit [`ba12fca`](https://github.com/apache/spark/commit/ba12fca8514a2cb620224e859e8a8b5cc208bf31).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20959


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #88818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88818/testReport)** for PR 20959 at commit [`91c57cf`](https://github.com/apache/spark/commit/91c57cf898d7f62ee3c9acff63607fcb102bd317).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89155 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89155/testReport)** for PR 20959 at commit [`d584cfe`](https://github.com/apache/spark/commit/d584cfe19135c1e742b1ae1f2873b827a32c0d7a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
     _if I remember this correctly_, there was a discussion about it a lone ago and, @rxin was not sure how much it improves the perf and if it's worth, which I ended up with agreeing with. @rxin, did I recall this correctly?
    
    There's a workaround for it BTW:
    
    ```scala
    val ds = Seq("a", "b", "c", "d").toDS
    val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema
    spark.read.schema(sampledSchema).csv(ds)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89217/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89677/testReport)** for PR 20959 at commit [`0737bf7`](https://github.com/apache/spark/commit/0737bf7717f6b1f253c9d78013065e7147279607).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180942841
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1278,68 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRatio < 1.0") {
    +    // Set default values for the DataSource parameters to make sure
    +    // that whole test file is mapped to only one partition. This will guarantee
    +    // reliable sampling of the input file.
    +    withSQLConf(
    +      "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
    +      "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
    +    )(withTempPath { path =>
    +      val rdd = spark.sqlContext.range(0, 100).map {row =>
    +        val predefinedSample = Set[Long](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +          57, 62, 68, 72)
    +        val value = row.getLong(0)
    +        if (predefinedSample.contains(value)) {
    +          value.toString
    +        } else {
    +          (value.toDouble + 0.1).toString
    +        }
    +      }.repartition(1)
    +      rdd.write.text(path.getAbsolutePath)
    +
    +      val ds = spark.read
    +        .option("inferSchema", true)
    +        .option("samplingRatio", 0.1)
    +        .csv(path.getCanonicalPath)
    +      assert(ds.schema == new StructType().add("_c0", IntegerType))
    +    })
    +  }
    +
    +  test("SPARK-23846: usage of samplingRatio while parsing a dataset of strings") {
    +    val dstr = spark.sparkContext.parallelize(0 until 100, 1).map { i =>
    +      val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    --- End diff --
    
    Can we take this out of `map`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89664/testReport)** for PR 20959 at commit [`257b363`](https://github.com/apache/spark/commit/257b3638ae0db7051dd25affcaf8967a5a29db5d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89967/testReport)** for PR 20959 at commit [`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89664/testReport)** for PR 20959 at commit [`257b363`](https://github.com/apache/spark/commit/257b3638ae0db7051dd25affcaf8967a5a29db5d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    @gatorsmile @HyukjinKwon @sujithjay May I ask you to look at the PR again


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183981806
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1280,72 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRatio < 1.0") {
    +    // Set default values for the DataSource parameters to make sure
    +    // that whole test file is mapped to only one partition. This will guarantee
    +    // reliable sampling of the input file.
    +    withSQLConf(
    +      "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
    +      "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
    +    )(withTempPath { path =>
    +      val ds = sampledTestData.coalesce(1)
    +      ds.write.text(path.getAbsolutePath)
    +
    +      val readback = spark.read
    +        .option("inferSchema", true).option("samplingRatio", 0.1)
    +        .csv(path.getCanonicalPath)
    +      assert(readback.schema == new StructType().add("_c0", IntegerType))
    +    })
    +  }
    +
    +  test("SPARK-23846: usage of samplingRatio while parsing a dataset of strings") {
    +    val ds = sampledTestData.coalesce(1)
    +    val readback = spark.read
    +      .option("inferSchema", true).option("samplingRatio", 0.1)
    +      .csv(ds)
    +
    +    assert(readback.schema == new StructType().add("_c0", IntegerType))
    +  }
    +
    +  test("SPARK-23846: samplingRatio is out of the range (0, 1.0]") {
    +    val ds = spark.range(0, 100, 1, 1).map(_.toString)
    +
    +    val errorMsg0 = intercept[IllegalArgumentException] {
    +      spark.read.option("inferSchema", true).option("samplingRatio", -1).csv(ds)
    +    }.getMessage
    +    assert(errorMsg0.contains("samplingRatio (-1.0) should be greater than 0"))
    +
    +    val errorMsg1 = intercept[IllegalArgumentException] {
    +      spark.read.option("inferSchema", true).option("samplingRatio", 0).csv(ds)
    +    }.getMessage
    +    assert(errorMsg1.contains("samplingRatio (0.0) should be greater than 0"))
    +
    +    val sampled = spark.read.option("inferSchema", true).option("samplingRatio", 10).csv(ds)
    --- End diff --
    
    Can we leave this case out? Looks a bit odd we test with sampling ratio more than 1. Maybe we can just leave this case out here for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89155/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    For usability, the workaround I suggested above has more flexibility. For example, we can make different operation (e.g, filter) on schema inference path. They are only few lines.
    
    Schema inference is discouraged in production line. I believe, for example, just taking 100 records and use the schema makes more sense.
    
    I am not against this option but I am saying I don't feel strong on this for above reasons. 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88817/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89239/testReport)** for PR 20959 at commit [`d12c2e2`](https://github.com/apache/spark/commit/d12c2e221b446596e2322a4d86ee0fd55a09b6ba).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #88827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88827/testReport)** for PR 20959 at commit [`b6a7cc8`](https://github.com/apache/spark/commit/b6a7cc852bb7f27a94eb268f4f3e5e35f97538bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89220/testReport)** for PR 20959 at commit [`1427f73`](https://github.com/apache/spark/commit/1427f73e13b5809ac9622bf5fbd80cb178544864).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89879/testReport)** for PR 20959 at commit [`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180537052
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -528,6 +529,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
        * <li>`header` (default `false`): uses the first line as names of columns.</li>
        * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It
        * requires one extra pass over the data.</li>
    +   * <li>`samplingRatio` (default 1.0): the sample ratio of rows used for schema inferring.</li>
    --- End diff --
    
    I just did absolutely the same as `samplingRatio` for JSON datasource - `samplingRatio` is not supported as a parameter of `json()` but could be set as the option like  `.option('samplingRatio', '0.1')`.  Should I update PySpark API regarding `samplingRatio` for `json()` in separate PR? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r179934764
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1279,45 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    --- End diff --
    
    `val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46)` is enough. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89155/testReport)** for PR 20959 at commit [`d584cfe`](https://github.com/apache/spark/commit/d584cfe19135c1e742b1ae1f2873b827a32c0d7a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r178618601
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("schema inferring touches less data if samplingRation < 1.0") {
    --- End diff --
    
    @sujithjay I created separate PR for the tests: https://github.com/apache/spark/pull/20963


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89677/testReport)** for PR 20959 at commit [`0737bf7`](https://github.com/apache/spark/commit/0737bf7717f6b1f253c9d78013065e7147279607).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180942904
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1278,68 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRatio < 1.0") {
    +    // Set default values for the DataSource parameters to make sure
    +    // that whole test file is mapped to only one partition. This will guarantee
    +    // reliable sampling of the input file.
    +    withSQLConf(
    +      "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
    +      "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
    +    )(withTempPath { path =>
    +      val rdd = spark.sqlContext.range(0, 100).map {row =>
    +        val predefinedSample = Set[Long](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +          57, 62, 68, 72)
    +        val value = row.getLong(0)
    +        if (predefinedSample.contains(value)) {
    +          value.toString
    +        } else {
    +          (value.toDouble + 0.1).toString
    +        }
    +      }.repartition(1)
    +      rdd.write.text(path.getAbsolutePath)
    +
    +      val ds = spark.read
    +        .option("inferSchema", true)
    +        .option("samplingRatio", 0.1)
    +        .csv(path.getCanonicalPath)
    +      assert(ds.schema == new StructType().add("_c0", IntegerType))
    +    })
    +  }
    +
    +  test("SPARK-23846: usage of samplingRatio while parsing a dataset of strings") {
    +    val dstr = spark.sparkContext.parallelize(0 until 100, 1).map { i =>
    +      val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +        57, 62, 68, 72)
    +      if (predefinedSample.contains(i)) {
    +        i.toString + "\n"
    +      } else {
    +        (i.toDouble + 0.1) + "\n"
    --- End diff --
    
    Hm .. does it need `\n`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89154/testReport)** for PR 20959 at commit [`d4e815e`](https://github.com/apache/spark/commit/d4e815e8564e4c81933540896a0d85f4c8225ea2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180981947
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1278,64 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  val sampledTestData = (row: Row) => {
    --- End diff --
    
    I have to define it as `val` because `def` contains a reference to outer class `CSVSuite` and the lambda inside of `map(...)` becomes not serializable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r182492483
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -161,7 +161,8 @@ object TextInputCSVDataSource extends CSVDataSource {
           val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
           val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
           val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
    -      val tokenRDD = csv.rdd.mapPartitions { iter =>
    +      val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions)
    +      val tokenRDD = sampled.rdd.mapPartitions { iter =>
    --- End diff --
    
    No, it doesn't. The behavior is the same as in JSON datasource. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89217/testReport)** for PR 20959 at commit [`9f26bb7`](https://github.com/apache/spark/commit/9f26bb78a5be28eb38c3fd178f4f88b66b8917ff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183982964
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1280,72 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRatio < 1.0") {
    +    // Set default values for the DataSource parameters to make sure
    +    // that whole test file is mapped to only one partition. This will guarantee
    +    // reliable sampling of the input file.
    +    withSQLConf(
    +      "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
    +      "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
    +    )(withTempPath { path =>
    +      val ds = sampledTestData.coalesce(1)
    +      ds.write.text(path.getAbsolutePath)
    +
    +      val readback = spark.read
    +        .option("inferSchema", true).option("samplingRatio", 0.1)
    +        .csv(path.getCanonicalPath)
    +      assert(readback.schema == new StructType().add("_c0", IntegerType))
    +    })
    +  }
    +
    +  test("SPARK-23846: usage of samplingRatio while parsing a dataset of strings") {
    +    val ds = sampledTestData.coalesce(1)
    +    val readback = spark.read
    +      .option("inferSchema", true).option("samplingRatio", 0.1)
    +      .csv(ds)
    +
    +    assert(readback.schema == new StructType().add("_c0", IntegerType))
    +  }
    +
    +  test("SPARK-23846: samplingRatio is out of the range (0, 1.0]") {
    +    val ds = spark.range(0, 100, 1, 1).map(_.toString)
    +
    +    val errorMsg0 = intercept[IllegalArgumentException] {
    +      spark.read.option("inferSchema", true).option("samplingRatio", -1).csv(ds)
    +    }.getMessage
    +    assert(errorMsg0.contains("samplingRatio (-1.0) should be greater than 0"))
    +
    +    val errorMsg1 = intercept[IllegalArgumentException] {
    +      spark.read.option("inferSchema", true).option("samplingRatio", 0).csv(ds)
    +    }.getMessage
    +    assert(errorMsg1.contains("samplingRatio (0.0) should be greater than 0"))
    +
    +    val sampled = spark.read.option("inferSchema", true).option("samplingRatio", 10).csv(ds)
    +    assert(sampled.count() == ds.count())
    +  }
    +
    +  test("SPARK-23846: sampling files for schema inferring in the multiLine mode") {
    +    withTempDir { dir =>
    +      Files.write(Paths.get(dir.getAbsolutePath, "0.csv"), "0.1".getBytes,
    +        StandardOpenOption.CREATE_NEW)
    +      for (i <- 1 until 10) {
    +        Files.write(Paths.get(dir.getAbsolutePath, s"$i.csv"), s"$i".getBytes,
    +          StandardOpenOption.CREATE_NEW)
    +      }
    +      val files = (0 until 10).map { i =>
    +        val hadoopConf = spark.sessionState.newHadoopConf()
    +        val hdfsPath = new Path(Paths.get(dir.getAbsolutePath, s"$i.csv").toString)
    +        hdfsPath.getFileSystem(hadoopConf).getFileStatus(hdfsPath)
    +      }
    +      // The test uses the internal method because public API cannot guarantee order of files
    --- End diff --
    
    ditto - https://github.com/apache/spark/pull/21056#discussion_r183982653


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    LGTM
    
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89964/testReport)** for PR 20959 at commit [`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #88818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88818/testReport)** for PR 20959 at commit [`91c57cf`](https://github.com/apache/spark/commit/91c57cf898d7f62ee3c9acff63607fcb102bd317).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    I'm good with having this option given the data @MaxGekk posted. (I haven't reviewed the code - somebody else should do that before merging).
    
    `val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema` is a bit clunky compared with an option that applies to all the sources.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89967/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r182488630
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -161,7 +161,8 @@ object TextInputCSVDataSource extends CSVDataSource {
           val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
           val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
           val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
    -      val tokenRDD = csv.rdd.mapPartitions { iter =>
    +      val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions)
    +      val tokenRDD = sampled.rdd.mapPartitions { iter =>
    --- End diff --
    
    wait .. does this work when `multiLine` is enabled?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89796 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89796/testReport)** for PR 20959 at commit [`b2c552c`](https://github.com/apache/spark/commit/b2c552c3aef2c6361e669af107330f721398a0bc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils  with TestCsvData `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r182608371
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -161,7 +161,8 @@ object TextInputCSVDataSource extends CSVDataSource {
           val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
           val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
           val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
    -      val tokenRDD = csv.rdd.mapPartitions { iter =>
    +      val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions)
    +      val tokenRDD = sampled.rdd.mapPartitions { iter =>
    --- End diff --
    
    ```
    $ tree .
    .
    ├── a.json
    ├── b.json
    ├── c.json
    ├── d.json
    ├── e.json
    ├── f.json
    └── h.json
    
    0 directories, 7 files
    ```
    
    ```
    $ cat *
    {"a": "a"}
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    
    {"a": 1}
    ```
    
    ```scala
    scala> spark.read.option("samplingRatio", 0.2).option("multiLine", true).json("tmp").schema
    res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
    
    scala> spark.read.option("samplingRatio", 1).option("multiLine", true).json("tmp").schema
    res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,true))
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by sujithjay <gi...@git.apache.org>.

Github user sujithjay commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r178567646
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("schema inferring touches less data if samplingRation < 1.0") {
    --- End diff --
    
    Hi @MaxGekk , since these test-cases in `JsonSuite` are not related to the changes made in this patch, can it be raised as a separate PR? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    @MaxGekk Thanks for working on this!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89879/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183979407
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3009,6 +3009,15 @@ def test_sort_with_nulls_order(self):
                 df.select(df.name).orderBy(functions.desc_nulls_last('name')).collect(),
                 [Row(name=u'Tom'), Row(name=u'Alice'), Row(name=None)])
     
    +    def test_csv_sampling_ratio(self):
    +        rdd = self.spark.sparkContext.range(0, 100)\
    +            .map(lambda x: '0.1' if x == 1 else str(x))\
    +            .repartition(1)
    +        schema = self.spark.read.option('inferSchema', True)\
    --- End diff --
    
    Let's test `.csv(samplingRatio=0.5, ...)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89796/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89964/testReport)** for PR 20959 at commit [`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89208/testReport)** for PR 20959 at commit [`3bceb3a`](https://github.com/apache/spark/commit/3bceb3a66064778be2bf568d3ad6382152fe9bd7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r179934767
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1279,45 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(i.toString + "\n")
    +        } else {
    +          writer.write((i.toDouble + 0.1).toString + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read
    +        .option("inferSchema", true)
    +        .option("samplingRatio", 0.1)
    +        .csv(path.getCanonicalPath)
    +      assert(ds.schema == new StructType().add("_c0", IntegerType))
    +    }
    +  }
    +
    +  test("SPARK-23846: usage of samplingRation while parsing of dataset of strings") {
    +    val dstr = spark.sparkContext.parallelize(0 until 100, 1).map { i =>
    +      val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +        57, 62, 68, 72)
    +      if (predefinedSample.contains(i)) {
    +        i.toString + "\n"
    +      } else {
    +        (i.toDouble + 0.1) + "\n"
    +      }
    +    }.toDS()
    +    val ds = spark.read
    +      .option("inferSchema", true)
    +      .option("samplingRatio", 0.1)
    --- End diff --
    
    Add some negative case. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89241/testReport)** for PR 20959 at commit [`a37bf3b`](https://github.com/apache/spark/commit/a37bf3bc4034f587768903df304e5642a68f87c8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89154 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89154/testReport)** for PR 20959 at commit [`d4e815e`](https://github.com/apache/spark/commit/d4e815e8564e4c81933540896a0d85f4c8225ea2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r179934772
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -528,6 +529,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
        * <li>`header` (default `false`): uses the first line as names of columns.</li>
        * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It
        * requires one extra pass over the data.</li>
    +   * <li>`samplingRatio` (default 1.0): the sample ratio of rows used for schema inferring.</li>
    --- End diff --
    
    Also need to update the PySpark API


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Let's address @gatorsmile's and mine at https://github.com/apache/spark/pull/20963 too as well. Seems fine otherwise.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20959

@rxin I made an experiment on json files but numbers for csv are almost the same. For example, inferring schema for 50GB json:
```
scala> spark.read.option("samplingRatio", 0.000000001).json("test.json")
```
took 1.7 minute
```
scala> spark.read.option("samplingRatio", 1.0).json("test.json")
```
took 21.9 minutes.

I have looked in a profiler where Spark spends time during schema inferring for 50GB json. At least on my laptop - 75% in json parsing and 18% on disk IO. Of course, the numbers will be different in a cluster if the files would be read from s3 via network. In any way, the samplingRatio option gives us opportunity to find a balance of CPUs load and network/disk IO.

@HyukjinKwon The question is not about workaround, it is about usability:

1. For interactive queries, an user doesn't have to write the boilerplate code if there is the option.

2. If the code is used inside of a library, developers don't have to check special cases like if it is json use the samplingRatio option otherwise do sampling manually.

Additionally the behavior behind of the option could be improved in the future. For example, it will require less file reads during sampling. It would be easer to do that with the option.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r182656641
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ---
    @@ -161,7 +161,8 @@ object TextInputCSVDataSource extends CSVDataSource {
           val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
           val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
           val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
    -      val tokenRDD = csv.rdd.mapPartitions { iter =>
    +      val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions)
    +      val tokenRDD = sampled.rdd.mapPartitions { iter =>
    --- End diff --
    
    Thank you @HyukjinKwon , I didn't know the `samplingRatio` option in the case of `multiLine` means sampling of files but not lines. I will add the implementation for CSV and update the PR https://github.com/apache/spark/pull/21056 by new tests. I guess we should explicitly describe the behavior in the comments of `samplingRatio`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180942647
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3009,6 +3009,15 @@ def test_sort_with_nulls_order(self):
                 df.select(df.name).orderBy(functions.desc_nulls_last('name')).collect(),
                 [Row(name=u'Tom'), Row(name=u'Alice'), Row(name=None)])
     
    +    def test_csv_sampling_ratio(self):
    +        rdd = self.spark.sparkContext.range(0, 100).\
    +            map(lambda x: '0.1' if x == 1 else str(x)).\
    --- End diff --
    
    `.\nmap` -> `\n.map`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89241/testReport)** for PR 20959 at commit [`a37bf3b`](https://github.com/apache/spark/commit/a37bf3bc4034f587768903df304e5642a68f87c8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89879/testReport)** for PR 20959 at commit [`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89664/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183226807
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -882,6 +882,9 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No
                                               the quote character. If None is set, the default value is
                                               escape character when escape and quote characters are
                                               different, ``\0`` otherwise..
    +        :param samplingRatio: defines fraction of rows (when ``multiLine`` is ``false``) or fraction
    +                              of files (when ``multiLine`` is ``true``) used for schema inferring.
    --- End diff --
    
    @MaxGekk, I think in CSV's case we shouldn't do sample based on the files .. JSON's case is a record per file whereas CSV allows multiple lines in a file ... 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89964/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #88817 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88817/testReport)** for PR 20959 at commit [`ba12fca`](https://github.com/apache/spark/commit/ba12fca8514a2cb620224e859e8a8b5cc208bf31).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89154/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r181165340
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -528,6 +529,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
        * <li>`header` (default `false`): uses the first line as names of columns.</li>
        * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It
        * requires one extra pass over the data.</li>
    +   * <li>`samplingRatio` (default 1.0): the sample ratio of rows used for schema inferring.</li>
    --- End diff --
    
    Please, have a look at the PR: https://github.com/apache/spark/pull/21056


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89677/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88818/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89208/testReport)** for PR 20959 at commit [`3bceb3a`](https://github.com/apache/spark/commit/3bceb3a66064778be2bf568d3ad6382152fe9bd7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    sure, will review this and go ahead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89967/testReport)** for PR 20959 at commit [`d4d9d65`](https://github.com/apache/spark/commit/d4d9d65ce28c4176c085449564c8e5f8ec0b3ff7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r183979802
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -19,23 +19,24 @@ package org.apache.spark.sql.execution.datasources.csv
     
     import java.io.File
     import java.nio.charset.UnsupportedCharsetException
    +import java.nio.file.{Files, Paths, StandardOpenOption}
     import java.sql.{Date, Timestamp}
     import java.text.SimpleDateFormat
     import java.util.Locale
     
     import org.apache.commons.lang3.time.FastDateFormat
    +import org.apache.hadoop.fs.Path
     import org.apache.hadoop.io.SequenceFile.CompressionType
     import org.apache.hadoop.io.compress.GzipCodec
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.{AnalysisException, DataFrame, QueryTest, Row, UDT}
     import org.apache.spark.sql.catalyst.util.DateTimeUtils
    -import org.apache.spark.sql.functions.{col, regexp_replace}
     import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
     import org.apache.spark.sql.types._
     
    -class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
    +class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils  with TestCsvData {
    --- End diff --
    
    `  ` -> ` `.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20959: [SPARK-23846][SQL] The samplingRatio option for C...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20959#discussion_r180529086
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -1279,4 +1279,45 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
         )
       }
    +
    +  test("SPARK-23846: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    --- End diff --
    
    Sampling of 100 elements return 13 elements, if it is called with `withReplacement` = `false` and `seed` = `1` . The test fails if I select only 10 elements. I took the same values for  `withReplacement` and `seed` as in JSON datasource.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89239/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89208/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89217/testReport)** for PR 20959 at commit [`9f26bb7`](https://github.com/apache/spark/commit/9f26bb78a5be28eb38c3fd178f4f88b66b8917ff).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20959
  
    **[Test build #89796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89796/testReport)** for PR 20959 at commit [`b2c552c`](https://github.com/apache/spark/commit/b2c552c3aef2c6361e669af107330f721398a0bc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org