You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/01/23 13:50:19 UTC

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/16680

    [SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON

    ## What changes were proposed in this pull request?
    
    This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into `CSVUtils`
    
     It makes the methods in classes have consistent arguments with JSON ones.
    
    `UnivocityParser` and `JacksonParser` 
        
    ``` scala
    private[csv] object CSVInferSchema {
      ...
        
      def infer(
          csv: Dataset[String],
          caseSensitive: Boolean,
          options: CSVOptions): StructType = {
      ...
    ```
        
    ``` scala
    private[sql] object InferSchema {
      ...
        
      def infer(
          json: RDD[String],
          columnNameOfCorruptRecord: String,
          configOptions: JSONOptions): StructType = {
      ...
    ```
    
    These allow schema inference from `Dataset[String]` directly. This completes refactoring CSV datasource and they are now pretty consistent.
    
    ## How was this patch tested?
    
    Existing tests should cover this.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-16101-schema-inference

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16680
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Yes, I tried to only move and does not touch the code path at my best. Let me check this again tomorrow with Scala 2.10 build and ping you again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71843/testReport)** for PR 16680 at commit [`5562911`](https://github.com/apache/spark/commit/55629117b5ff656e2c68f566bf6591be211c2e9d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Let me double check tomorrow in case and then cc someone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71967/testReport)** for PR 16680 at commit [`37e0296`](https://github.com/apache/spark/commit/37e029687f1af1d5acfbe8111c6da8987a20abf1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72446/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99583137
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -170,32 +111,21 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
             }
           }
     
    -      // Consumes the header in the iterator.
    -      CSVRelation.dropHeaderLine(file, lineIterator, csvOptions)
    -
    -      val filteredIter = lineIterator.filter { line =>
    -        line.trim.nonEmpty && !line.startsWith(commentPrefix)
    +      val linesWithoutHeader = if (csvOptions.headerFlag && file.start == 0) {
    +        // Note that if there are only comments in the first block, the header would probably
    +        // be not dropped.
    +        CSVUtils.dropHeaderLine(lines, csvOptions)
    +      } else {
    +        lines
           }
     
    +      val filteredLines = CSVUtils.filterCommentAndEmpty(linesWithoutHeader, csvOptions)
           val parser = new UnivocityParser(dataSchema, requiredSchema, csvOptions)
    -      filteredIter.flatMap(parser.parse)
    -    }
    -  }
    -
    -  /**
    -   * Returns the first line of the first non-empty file in path
    -   */
    -  private def findFirstLine(options: CSVOptions, lines: Dataset[String]): String = {
    -    import lines.sqlContext.implicits._
    -    val nonEmptyLines = lines.filter(length(trim($"value")) > 0)
    -    if (options.isCommentSet) {
    -      nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).first()
    -    } else {
    -      nonEmptyLines.first()
    +      filteredLines.flatMap(parser.parse)
         }
       }
     
    -  private def readText(
    --- End diff --
    
    Sure, either way is fine with me. I just resembled it from `JsonFileFormat.createBaseRDD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72024/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71928/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    @cloud-fan, I just mainly resembled ones in JSON datasource and I am pretty sure you knew this when you added some comments. But let me just rebase this as is for now just in case maybe you are okay with them above and missed my reasons. 
    
    I know it is not always true to follow existing implementation but maybe we could rename them together if the other one does not look appropriate. (I am fine with either way but just want to be sure that you know I had some reasons).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema in...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r97706016
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -39,22 +37,76 @@ private[csv] object CSVInferSchema {
        *     3. Replace any null types with string type
        */
       def infer(
    -      tokenRdd: RDD[Array[String]],
    -      header: Array[String],
    +      csv: Dataset[String],
    +      caseSensitive: Boolean,
           options: CSVOptions): StructType = {
    -    val startType: Array[DataType] = Array.fill[DataType](header.length)(NullType)
    -    val rootTypes: Array[DataType] =
    -      tokenRdd.aggregate(startType)(inferRowType(options), mergeRowTypes)
    -
    -    val structFields = header.zip(rootTypes).map { case (thisHeader, rootType) =>
    -      val dType = rootType match {
    -        case _: NullType => StringType
    -        case other => other
    +    val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first()
    --- End diff --
    
    Both `CSVUtils.filterCommentAndEmpty` usages here and below should exactly the same up to my knowledge but I let them as are just simply to keep the behaviour for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99628216
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -39,22 +37,76 @@ private[csv] object CSVInferSchema {
        *     3. Replace any null types with string type
        */
       def infer(
    -      tokenRdd: RDD[Array[String]],
    -      header: Array[String],
    +      csv: Dataset[String],
    +      caseSensitive: Boolean,
           options: CSVOptions): StructType = {
    -    val startType: Array[DataType] = Array.fill[DataType](header.length)(NullType)
    -    val rootTypes: Array[DataType] =
    -      tokenRdd.aggregate(startType)(inferRowType(options), mergeRowTypes)
    -
    -    val structFields = header.zip(rootTypes).map { case (thisHeader, rootType) =>
    -      val dType = rootType match {
    -        case _: NullType => StringType
    -        case other => other
    +    val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first()
    --- End diff --
    
    please send a follow-up PR for this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r97321669
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -1,69 +0,0 @@
    -/*
    --- End diff --
    
    These mainly into `CSVUtils`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99530031
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -170,32 +111,21 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
             }
           }
     
    -      // Consumes the header in the iterator.
    -      CSVRelation.dropHeaderLine(file, lineIterator, csvOptions)
    -
    -      val filteredIter = lineIterator.filter { line =>
    -        line.trim.nonEmpty && !line.startsWith(commentPrefix)
    +      val linesWithoutHeader = if (csvOptions.headerFlag && file.start == 0) {
    +        // Note that if there are only comments in the first block, the header would probably
    +        // be not dropped.
    +        CSVUtils.dropHeaderLine(lines, csvOptions)
    +      } else {
    +        lines
           }
     
    +      val filteredLines = CSVUtils.filterCommentAndEmpty(linesWithoutHeader, csvOptions)
           val parser = new UnivocityParser(dataSchema, requiredSchema, csvOptions)
    -      filteredIter.flatMap(parser.parse)
    -    }
    -  }
    -
    -  /**
    -   * Returns the first line of the first non-empty file in path
    -   */
    -  private def findFirstLine(options: CSVOptions, lines: Dataset[String]): String = {
    -    import lines.sqlContext.implicits._
    -    val nonEmptyLines = lines.filter(length(trim($"value")) > 0)
    -    if (options.isCommentSet) {
    -      nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).first()
    -    } else {
    -      nonEmptyLines.first()
    +      filteredLines.flatMap(parser.parse)
         }
       }
     
    -  private def readText(
    --- End diff --
    
    seems `readAsLines` is better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    @cloud-fan, I just built it with 2.10 and checked it does not touch original code path, line by line at my best.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #72024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72024/testReport)** for PR 16680 at commit [`ad09417`](https://github.com/apache/spark/commit/ad09417425cbf10d3f960a7ceeb0e72e2cae9d94).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71966/testReport)** for PR 16680 at commit [`15c4dec`](https://github.com/apache/spark/commit/15c4dec8c4a35abd6e2dcf47dd2f2e9bc37c8129).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16680


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Thank you so much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71966/testReport)** for PR 16680 at commit [`15c4dec`](https://github.com/apache/spark/commit/15c4dec8c4a35abd6e2dcf47dd2f2e9bc37c8129).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99530062
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -39,22 +37,76 @@ private[csv] object CSVInferSchema {
        *     3. Replace any null types with string type
        */
       def infer(
    -      tokenRdd: RDD[Array[String]],
    -      header: Array[String],
    +      csv: Dataset[String],
    --- End diff --
    
    nit: `csvLines`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    @cloud-fan, could you please take a look? I tried to not change the current behaviour and logics at my best but just re-locate them here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #72446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72446/testReport)** for PR 16680 at commit [`6f7fa9b`](https://github.com/apache/spark/commit/6f7fa9b91fc5e86c3de06530486d29bbaf19079f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r97321455
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -170,32 +111,21 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
             }
           }
     
    -      // Consumes the header in the iterator.
    --- End diff --
    
    The removed two blocks below are into `CSVUtils`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71844 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71844/testReport)** for PR 16680 at commit [`2b03c9d`](https://github.com/apache/spark/commit/2b03c9dbd974df46ff66163ab3de592efb91bb3b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71928/testReport)** for PR 16680 at commit [`0f7b9b8`](https://github.com/apache/spark/commit/0f7b9b8b17f79c83c920682f000d7c4eb4cda291).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71966/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r97321283
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -60,64 +57,9 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
     
         val csvOptions = new CSVOptions(options)
         val paths = files.map(_.getPath.toString)
    -    val lines: Dataset[String] = readText(sparkSession, csvOptions, paths)
    -    val firstLine: String = findFirstLine(csvOptions, lines)
    -    val firstRow = new CsvParser(csvOptions.asParserSettings).parseLine(firstLine)
    +    val lines: Dataset[String] = createBaseDataset(sparkSession, csvOptions, paths)
         val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
    -    val header = makeSafeHeader(firstRow, csvOptions, caseSensitive)
    --- End diff --
    
    These removed block is all into `CSVInferSchema.infer(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71928/testReport)** for PR 16680 at commit [`0f7b9b8`](https://github.com/apache/spark/commit/0f7b9b8b17f79c83c920682f000d7c4eb4cda291).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r97321103
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala ---
    @@ -0,0 +1,116 @@
    +/*
    --- End diff --
    
    This class might be too much. I am willing to revert this back.
    
    The only reason I make them into here is, there are similar logics for dealing with header, comment and empty strings in reading/schema inference path, and they look a bit messy and easily letting other engineers make a mistake (actually they are already different even though some cases look required to be the same but I did not fix it here to keep the original behaviour).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99530531
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala ---
    @@ -0,0 +1,134 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import org.apache.spark.sql.Dataset
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +
    +object CSVUtils {
    +  /**
    +   * Filter ignorable rows for CSV dataset (lines empty and starting with `comment`).
    +   * This is currently being used in CSV schema inference.
    +   */
    +  def filterCommentAndEmpty(lines: Dataset[String], options: CSVOptions): Dataset[String] = {
    +    // Note that this was separately made by SPARK-18362. Logically, this should be the same
    +    // with the one below, `filterCommentAndEmpty` but execution path is different. One of them
    +    // might have to be removed in the near future if possible.
    +    import lines.sqlContext.implicits._
    +    val nonEmptyLines = lines.filter(length(trim($"value")) > 0)
    +    if (options.isCommentSet) {
    +      nonEmptyLines.filter(!$"value".startsWith(options.comment.toString))
    +    } else {
    +      nonEmptyLines
    +    }
    +  }
    +
    +  /**
    +   * Filter ignorable rows for CSV iterator (lines empty and starting with `comment`).
    +   * This is currently being used in CSV reading path and CSV schema inference.
    +   */
    +  def filterCommentAndEmpty(iter: Iterator[String], options: CSVOptions): Iterator[String] = {
    +    iter.filter { line =>
    +      line.trim.nonEmpty && !line.startsWith(options.comment.toString)
    +    }
    +  }
    +
    +  /**
    +   * Skip the given first line so that only data can remain in a dataset.
    +   * This is similar with `dropHeaderLine` below and currently being used in CSV schema inference.
    +   */
    +  def filterHeaderLine(
    +       iter: Iterator[String],
    +       firstLine: String,
    +       options: CSVOptions): Iterator[String] = {
    +    // Note that unlike actual CSV reading path, it simply filters the given first line. Therefore,
    +    // this skips the line same with the header if exists. One of them might have to be removed
    +    // in the near future if possible.
    +    if (options.headerFlag) {
    +      iter.filterNot(_ == firstLine)
    +    } else {
    +      iter
    +    }
    +  }
    +
    +  /**
    +   * Drop header line so that only data can remain.
    +   * This is similar with `filterHeaderLine` above and currently being used in CSV reading path.
    +   */
    +  def dropHeaderLine(iter: Iterator[String], options: CSVOptions): Iterator[String] = {
    +    val nonEmptyLines = if (options.isCommentSet) {
    +      val commentPrefix = options.comment.toString
    +      iter.dropWhile { line =>
    +        line.trim.isEmpty || line.trim.startsWith(commentPrefix)
    +      }
    +    } else {
    +      iter.dropWhile(_.trim.isEmpty)
    +    }
    +
    +    if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
    +    iter
    +  }
    +
    +  /**
    +   * Helper method that converts string representation of a character to actual character.
    +   * It handles some Java escaped strings and throws exception if given string is longer than one
    +   * character.
    +   */
    +  @throws[IllegalArgumentException]
    +  def toChar(str: String): Char = {
    +    if (str.charAt(0) == '\\') {
    +      str.charAt(1)
    +      match {
    +        case 't' => '\t'
    +        case 'r' => '\r'
    +        case 'b' => '\b'
    +        case 'f' => '\f'
    +        case '\"' => '\"' // In case user changes quote char and uses \" as delimiter in options
    +        case '\'' => '\''
    +        case 'u' if str == """\u0000""" => '\u0000'
    +        case _ =>
    +          throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
    +      }
    +    } else if (str.length == 1) {
    +      str.charAt(0)
    +    } else {
    +      throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
    +    }
    +  }
    +
    +  /**
    +   * Verify if the schema is supported in CSV datasource.
    +   */
    +  def verifySchema(schema: StructType): Unit = {
    --- End diff --
    
    I think it's more related to `CsvInferSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r97321590
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -215,32 +267,3 @@ private[csv] object CSVInferSchema {
         case _ => None
       }
     }
    -
    -private[csv] object CSVTypeCast {
    -  /**
    -   * Helper method that converts string representation of a character to actual character.
    -   * It handles some Java escaped strings and throws exception if given string is longer than one
    -   * character.
    -   */
    -  @throws[IllegalArgumentException]
    --- End diff --
    
    This is into `CSVUtils`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    (I am fine with changing the name only for CSV ones for now as well. I would appreciate if you confirm please)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71845/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71844 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71844/testReport)** for PR 16680 at commit [`2b03c9d`](https://github.com/apache/spark/commit/2b03c9dbd974df46ff66163ab3de592efb91bb3b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71845 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71845/testReport)** for PR 16680 at commit [`44a4d93`](https://github.com/apache/spark/commit/44a4d93efedca1fed8b06bdd138a12d0d67a1438).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99583374
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -39,22 +37,76 @@ private[csv] object CSVInferSchema {
        *     3. Replace any null types with string type
        */
       def infer(
    -      tokenRdd: RDD[Array[String]],
    -      header: Array[String],
    +      csv: Dataset[String],
    --- End diff --
    
    This one too, I just resembled `json.InferSchema.infer` ...
    
    ```
      def infer(
          json: RDD[String],
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Sorry I didn't compare the CSV part with JSON part closely, I'm ok to keep them consistent now. To confirm, this PR just moves codes around right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    shall we rename `InferSchema` to `JsonInferSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71844/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71843/testReport)** for PR 16680 at commit [`5562911`](https://github.com/apache/spark/commit/55629117b5ff656e2c68f566bf6591be211c2e9d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71845/testReport)** for PR 16680 at commit [`44a4d93`](https://github.com/apache/spark/commit/44a4d93efedca1fed8b06bdd138a12d0d67a1438).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #72446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72446/testReport)** for PR 16680 at commit [`6f7fa9b`](https://github.com/apache/spark/commit/6f7fa9b91fc5e86c3de06530486d29bbaf19079f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #72024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72024/testReport)** for PR 16680 at commit [`ad09417`](https://github.com/apache/spark/commit/ad09417425cbf10d3f960a7ceeb0e72e2cae9d94).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16680: [SPARK-16101][SQL] Refactoring CSV schema inferen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16680#discussion_r99583376
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala ---
    @@ -0,0 +1,134 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import org.apache.spark.sql.Dataset
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +
    +object CSVUtils {
    +  /**
    +   * Filter ignorable rows for CSV dataset (lines empty and starting with `comment`).
    +   * This is currently being used in CSV schema inference.
    +   */
    +  def filterCommentAndEmpty(lines: Dataset[String], options: CSVOptions): Dataset[String] = {
    +    // Note that this was separately made by SPARK-18362. Logically, this should be the same
    +    // with the one below, `filterCommentAndEmpty` but execution path is different. One of them
    +    // might have to be removed in the near future if possible.
    +    import lines.sqlContext.implicits._
    +    val nonEmptyLines = lines.filter(length(trim($"value")) > 0)
    +    if (options.isCommentSet) {
    +      nonEmptyLines.filter(!$"value".startsWith(options.comment.toString))
    +    } else {
    +      nonEmptyLines
    +    }
    +  }
    +
    +  /**
    +   * Filter ignorable rows for CSV iterator (lines empty and starting with `comment`).
    +   * This is currently being used in CSV reading path and CSV schema inference.
    +   */
    +  def filterCommentAndEmpty(iter: Iterator[String], options: CSVOptions): Iterator[String] = {
    +    iter.filter { line =>
    +      line.trim.nonEmpty && !line.startsWith(options.comment.toString)
    +    }
    +  }
    +
    +  /**
    +   * Skip the given first line so that only data can remain in a dataset.
    +   * This is similar with `dropHeaderLine` below and currently being used in CSV schema inference.
    +   */
    +  def filterHeaderLine(
    +       iter: Iterator[String],
    +       firstLine: String,
    +       options: CSVOptions): Iterator[String] = {
    +    // Note that unlike actual CSV reading path, it simply filters the given first line. Therefore,
    +    // this skips the line same with the header if exists. One of them might have to be removed
    +    // in the near future if possible.
    +    if (options.headerFlag) {
    +      iter.filterNot(_ == firstLine)
    +    } else {
    +      iter
    +    }
    +  }
    +
    +  /**
    +   * Drop header line so that only data can remain.
    +   * This is similar with `filterHeaderLine` above and currently being used in CSV reading path.
    +   */
    +  def dropHeaderLine(iter: Iterator[String], options: CSVOptions): Iterator[String] = {
    +    val nonEmptyLines = if (options.isCommentSet) {
    +      val commentPrefix = options.comment.toString
    +      iter.dropWhile { line =>
    +        line.trim.isEmpty || line.trim.startsWith(commentPrefix)
    +      }
    +    } else {
    +      iter.dropWhile(_.trim.isEmpty)
    +    }
    +
    +    if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
    +    iter
    +  }
    +
    +  /**
    +   * Helper method that converts string representation of a character to actual character.
    +   * It handles some Java escaped strings and throws exception if given string is longer than one
    +   * character.
    +   */
    +  @throws[IllegalArgumentException]
    +  def toChar(str: String): Char = {
    +    if (str.charAt(0) == '\\') {
    +      str.charAt(1)
    +      match {
    +        case 't' => '\t'
    +        case 'r' => '\r'
    +        case 'b' => '\b'
    +        case 'f' => '\f'
    +        case '\"' => '\"' // In case user changes quote char and uses \" as delimiter in options
    +        case '\'' => '\''
    +        case 'u' if str == """\u0000""" => '\u0000'
    +        case _ =>
    +          throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
    +      }
    +    } else if (str.length == 1) {
    +      str.charAt(0)
    +    } else {
    +      throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
    +    }
    +  }
    +
    +  /**
    +   * Verify if the schema is supported in CSV datasource.
    +   */
    +  def verifySchema(schema: StructType): Unit = {
    --- End diff --
    
    This one resembled `JacksonUtils.verifySchema`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    **[Test build #71967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71967/testReport)** for PR 16680 at commit [`37e0296`](https://github.com/apache/spark/commit/37e029687f1af1d5acfbe8111c6da8987a20abf1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [SPARK-16101][SQL] Refactoring CSV schema inference path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71967/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71843/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16680: [WIP][SPARK-16101][SQL] Refactoring CSV schema inference...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16680
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org