You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2016/04/09 05:37:33 UTC

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/12268

    [SPARK-14480][SQL] Simplify CSV parsing process with a better performance

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-14480
    
    In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this.
    
    Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`.
    
    Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103).
    
    This PR removes classes `CSVParser` and introduces new classes `UnivocityParser`, `UnivocityGenerator` and `CSVUtils` to be consistent with JSON data source (`JacksonParser`, `JacksonGenerator` and `JacksonUtils`). Also, `DefaultSource` moves to `CSVRelation` just like `JSONRelation`.
    
    To cut in short, this PR includes two changes,
    
    - Parse CSV data with `Iterator` not `Reader`.
    - Refactor CSV data source to be consistent with JSON data source.
    
    ## How was this patch tested?
    
    Existing tests should cover this.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-14480

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12268.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12268
    
----
commit 0eff7e7ccf9c8298f0969b28e87532b70ddafc2e
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-04-09T03:27:13Z

    Simplify CSV parsing process with a better performance

commit b5a966962845011d6d56bb45d704d83eb5e06e38
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-04-09T03:36:17Z

    Remove unintentionally added test codes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217074901
  
    **[Test build #57834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57834/consoleFull)** for PR 12268 at commit [`a0aed27`](https://github.com/apache/spark/commit/a0aed27b7169caee50d0e97bceb6653202ba3f04).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221156923
  
    **[Test build #59175 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59175/consoleFull)** for PR 12268 at commit [`66b1757`](https://github.com/apache/spark/commit/66b17570a8d1ad53b5073bbfa439eb01b05413c1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59334104
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,153 +17,197 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
    +import org.apache.spark.broadcast.Broadcast
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.PartitionedFile
    +import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeProjection}
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources.{CompressionCodecs, HadoopFileLinesReader, PartitionedFile}
     import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
    +import org.apache.spark.util.collection.BitSet
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
    +
    +  override def inferSchema(
    +      sqlContext: SQLContext,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    // TODO: Move filtering.
    --- End diff --
    
    Actually, let me just leave it as it is. I am not 100% sure if that means what you just said.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-212227632
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-213663125
  
    **[Test build #56768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56768/consoleFull)** for PR 12268 at commit [`92f8f38`](https://github.com/apache/spark/commit/92f8f387cec10cb61e178b312748f86bd75b1b55).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216097353
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57498/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-213658568
  
    **[Test build #56768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56768/consoleFull)** for PR 12268 at commit [`92f8f38`](https://github.com/apache/spark/commit/92f8f387cec10cb61e178b312748f86bd75b1b55).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61258605
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    --- End diff --
    
    Again naming. At least add csv to the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214608175
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214614924
  
    Fixed in https://github.com/apache/spark/commit/f8709218115f6c7aa4fb321865cdef8ceb443bd1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221156930
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59174/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207897589
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207902136
  
    **[Test build #55461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55461/consoleFull)** for PR 12268 at commit [`262f346`](https://github.com/apache/spark/commit/262f3466c67718d8f2648ebb80da3cdc01c0baf1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214638048
  
    cc @hvanhovell would you have some time to take a look at this?
    
    @HyukjinKwon most of us are very busy trying to get things out for 2.0 so this one will very likely slip.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61271158
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    --- End diff --
    
    Why do we want this? `convertTokens` can't read beyond the `schemaFields.length` right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214625432
  
    **[Test build #56965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56965/consoleFull)** for PR 12268 at commit [`fe63ba2`](https://github.com/apache/spark/commit/fe63ba22d70c1427657b4967e769270d1956be38).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214939534
  
    **[Test build #57058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57058/consoleFull)** for PR 12268 at commit [`ee71064`](https://github.com/apache/spark/commit/ee7106416ef17e5168a91bab044c6f6db9dbd53b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class MultivariateGaussian(`
      * `class DecisionTreeClassifier @Since(\"1.4.0\") (`
      * `class GBTClassifier @Since(\"1.4.0\") (`
      * `class RandomForestClassifier @Since(\"1.4.0\") (`
      * `  class AFTSurvivalRegressionWrapperWriter(instance: AFTSurvivalRegressionWrapper)`
      * `  class AFTSurvivalRegressionWrapperReader extends MLReader[AFTSurvivalRegressionWrapper] `
      * `class DecisionTreeRegressor @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class GBTRegressor @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class RandomForestRegressor @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `case class CartesianProductExec(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214611901
  
    **[Test build #56963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56963/consoleFull)** for PR 12268 at commit [`f62755e`](https://github.com/apache/spark/commit/f62755e0875ae8f2947abf8a62505dd77b2ed9f5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61258460
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    +  /**
    +   * Transforms a single InternalRow to CSV using Univocity
    +   *
    +   * @param rowSchema the schema object used for conversion
    +   * @param writer a CsvWriter object
    +   * @param headers headers to write
    +   * @param writeHeader true if it needs to write header
    +   * @param options CSVOptions object containing options
    +   * @param row The row to convert
    +   */
    +  def apply(
    --- End diff --
    
    Please use a more descriptive name? `writeToCsv`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214607369
  
    **[Test build #56959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56959/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207729065
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207714445
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214608169
  
    **[Test build #56959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56959/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `sealed abstract class LDAModel protected[ml] (`
      * `class LocalLDAModel protected[ml] (`
      * `class DistributedLDAModel protected[ml] (`
      * `class ContinuousQueryManager(sparkSession: SparkSession) `
      * `class DataFrameReader protected[sql](sparkSession: SparkSession) extends Logging `
      * `class Dataset[T] protected[sql](`
      * `class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) `
      * `class FileStreamSinkLog(sparkSession: SparkSession, path: String)`
      * `class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String)`
      * `class StreamFileCatalog(sparkSession: SparkSession, path: Path) extends FileCatalog with Logging `
      * `case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-210383858
  
    Could I please cc @falaki who I believe is the original author?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208812371
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55599/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216097282
  
    **[Test build #57498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57498/consoleFull)** for PR 12268 at commit [`8e1bdf7`](https://github.com/apache/spark/commit/8e1bdf7176296eb9bd10f1249dd951abd0094191).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216091185
  
    **[Test build #57498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57498/consoleFull)** for PR 12268 at commit [`8e1bdf7`](https://github.com/apache/spark/commit/8e1bdf7176296eb9bd10f1249dd951abd0094191).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    **[Test build #60751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60751/consoleFull)** for PR 12268 at commit [`7abdfc1`](https://github.com/apache/spark/commit/7abdfc111166f2bf275fc4318c0ffe8836dcbb70).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214625562
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59180459
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVParserSuite.scala ---
    @@ -1,125 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.execution.datasources.csv
    -
    -import org.apache.spark.SparkFunSuite
    -
    -/**
    - * test cases for StringIteratorReader
    - */
    -class CSVParserSuite extends SparkFunSuite {
    --- End diff --
    
    Actually, I was a bit worried (and confused) of the range for tests. `CSVParserSuite` was testing the conversion from `Iterator` to `Reader` which does not exist now. So, I was thinking test codes in `UnivocityParser` and `UnivocityGenerator` because they are new classes anyway.
    
    But while trying to add some mode test codes, I realised that it ends up with an end-to-end just as below:
    ```scala
    test("Parse csv data correctly with univocity parser") {
      val cars = sparkContext.textFile(testFile("cars.csv"))
      val schema = StructType(
        StructField("year", IntegerType) ::
        StructField("make", StringType) ::
        StructField("model", StringType) ::
        StructField("comment", StringType) ::
        StructField("blank", StringType) :: Nil)
    
      val requiredSchema = StructType(
        StructField("year", IntegerType) ::
        StructField("make", StringType) ::
        StructField("model", StringType) :: Nil)
    
      val headers = schema.fields.map(_.name)
      val options = new CSVOptions(Map.empty)
      val filteredCars = cars.filter(_.nonEmpty)
      val firstLine = filteredCars.first()
      val dropHeaderCars = cars.filter(_ != firstLine)
    
      val parsedCars = UnivocityParser.parse(
        dropHeaderCars,
        schema,
        requiredSchema,
        headers,
        options).collect()
    
      val expectedCars = Seq(
        new GenericMutableRow(
          Array(2012, UTF8String.fromString("Tesla"), UTF8String.fromString("S"))),
        new GenericMutableRow(
          Array(1997, UTF8String.fromString("Ford"), UTF8String.fromString("E350"))),
        new GenericMutableRow(
          Array(2015, UTF8String.fromString("Chevy"), UTF8String.fromString("Vold"))))
    
      assert(parsedCars === expectedCars)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207913568
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55461/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61359944
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    --- End diff --
    
    Oh, I haven't tested this yet but I am sure this will work without this logic anyway but I think it is safe to slice this here.
    
    The size of `tokens` can be larger than `schemaFields`. I can remove this logic if you feel strongly weird but I feel like it might be okay to just leave.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221156929
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60751/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207694215
  
    **[Test build #55415 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55415/consoleFull)** for PR 12268 at commit [`b5a9669`](https://github.com/apache/spark/commit/b5a966962845011d6d56bb45d704d83eb5e06e38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207694169
  
    cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214624209
  
    **[Test build #56955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56955/consoleFull)** for PR 12268 at commit [`ad21b8e`](https://github.com/apache/spark/commit/ad21b8eea981f61cb35de646f3568b27dd2141a3).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207909391
  
    **[Test build #55459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55459/consoleFull)** for PR 12268 at commit [`55596e1`](https://github.com/apache/spark/commit/55596e1aeb5a1a4bcbafc24075146c1f94ac6daf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207913567
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-215275474
  
    @hvanhovell If you think it makes sense I will change the title of this PR and JIRA, and will add some more commits to deal with minor things (code style and etc.).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Maybe close this one for now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216094838
  
    **[Test build #57496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57496/consoleFull)** for PR 12268 at commit [`bd510c2`](https://github.com/apache/spark/commit/bd510c2b309f1da0099205838dd7856737c8ab61).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207897824
  
    Could I maybe cc @liancheng  and @cloud-fan to review? This resembles JSON data source structure. So, the class structures and input/output in methods are consistent with JSON data source one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214613793
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61358548
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    +  /**
    +   * Transforms a single InternalRow to CSV using Univocity
    +   *
    +   * @param rowSchema the schema object used for conversion
    +   * @param writer a CsvWriter object
    +   * @param headers headers to write
    +   * @param writeHeader true if it needs to write header
    +   * @param options CSVOptions object containing options
    +   * @param row The row to convert
    +   */
    +  def apply(
    --- End diff --
    
    The name was also taken after from JSON data source, `JacksonGenerator`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-215274872
  
    @hvanhovell Thank you for a close look! I think I need to change this title of this issue and JIRA because "better performance" might be too broad.
    
    The main purpose of this PR was,
     - Refactoring this to be consistent with JSON data source
     - Remove unnecessary conversion from `Iterator` to `Reader`.
    
    Could I please make some JIRAs and PRs for this in separate PRs or follow-ups if it makes sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    @rxin I see. Actually, I added a benchmark for this change in the JIRA.
    So.. would this be okay if I do as below:
    
    1. Get rid of `StringIteratorReader`
    2. Refactoring (maybe with a follow-up for the tests for.. `JacksonParser`, `JacksonGenerator`, `UnivocityParser` and `UnivocityGenerator`) 
    3. Use Record API from univocity and compare the performance (with a proper benchmark)
    4. Maybe other fixes addressing @hvanhovell's comments (maybe with a benchmark)
    
    If it makes sense, I would appreciate that If I can create a umbrella for those four as sub-tasts and be assigned to them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-215907300
  
    @rxin, @hvanhovell Do you mind if I ask your thoughts on this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214615366
  
    **[Test build #56965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56965/consoleFull)** for PR 12268 at commit [`fe63ba2`](https://github.com/apache/spark/commit/fe63ba22d70c1427657b4967e769270d1956be38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-212227633
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56309/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214614770
  
    @rxin It looks this is still failing, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56962
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56963



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214624337
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56955/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207714578
  
    **[Test build #55420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55420/consoleFull)** for PR 12268 at commit [`b5a9669`](https://github.com/apache/spark/commit/b5a966962845011d6d56bb45d704d83eb5e06e38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207706375
  
    **[Test build #55415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55415/consoleFull)** for PR 12268 at commit [`b5a9669`](https://github.com/apache/spark/commit/b5a966962845011d6d56bb45d704d83eb5e06e38).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208810922
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55597/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-212227517
  
    **[Test build #56309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56309/consoleFull)** for PR 12268 at commit [`d9ea3cb`](https://github.com/apache/spark/commit/d9ea3cb5ccb8db5d8ff9e36fa1e8d4df45ea4fb2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208812044
  
    **[Test build #55599 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55599/consoleFull)** for PR 12268 at commit [`5a95276`](https://github.com/apache/spark/commit/5a9527656e0484fb2840e2938c90f8d997035742).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61366691
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    +        } else {
    +          tokens
    +        }
    +        try {
    +          val row = convertTokens(
    +            indexSafeTokens,
    +            safeRequiredIndices,
    +            schemaFields,
    +            requiredSize,
    +            options)
    +          Some(row)
    +        } catch {
    +          case NonFatal(e) if options.dropMalformed =>
    +            logWarning("Parse exception. " +
    +              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +            None
    +        }
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Convert the tokens to [[InternalRow]]
    +   */
    +  private def convertTokens(
    +      tokens: Array[String],
    +      requiredIndices: Array[Int],
    +      schemaFields: Array[StructField],
    +      requiredSize: Int,
    +      options: CSVOptions): InternalRow = {
    +    val row = new GenericMutableRow(requiredSize)
    --- End diff --
    
    Oh yes! I noticed this too. JSON data source will does as far as I remember. This might have to be definitely changed. I think I could do this in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208768839
  
    **[Test build #55599 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55599/consoleFull)** for PR 12268 at commit [`5a95276`](https://github.com/apache/spark/commit/5a9527656e0484fb2840e2938c90f8d997035742).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217075074
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57834/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208810916
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207729042
  
    **[Test build #55420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55420/consoleFull)** for PR 12268 at commit [`b5a9669`](https://github.com/apache/spark/commit/b5a966962845011d6d56bb45d704d83eb5e06e38).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207897590
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55457/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61358445
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/InferSchema.scala ---
    @@ -30,22 +30,37 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
     import org.apache.spark.sql.types._
     import org.apache.spark.unsafe.types.UTF8String
     
    -private[csv] object CSVInferSchema {
    +private[csv] object InferSchema {
     
       /**
        * Similar to the JSON schema inference
        *     1. Infer type of each row
        *     2. Merge row types to find common type
        *     3. Replace any null types with string type
        */
    -  def infer(
    -      tokenRdd: RDD[Array[String]],
    -      header: Array[String],
    -      nullValue: String = ""): StructType = {
    +  def infer(csv: RDD[String], options: CSVOptions): StructType = {
    --- End diff --
    
    Actually, it does call this class method in `DefaultSource.inferSchema`. I intentionally made the same structure with `JSONRelation`. JSON data source also have the class with the same name and same method in order to fix issues easily in the future together . (Actually, the main purpose for refactoring this is inconsistency of structures although they could almost identical structures).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207897587
  
    **[Test build #55457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55457/consoleFull)** for PR 12268 at commit [`f65592b`](https://github.com/apache/spark/commit/f65592b6026f9b346b12e70faf4365e55c22d6e6).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217075071
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208765993
  
    **[Test build #55597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55597/consoleFull)** for PR 12268 at commit [`9015317`](https://github.com/apache/spark/commit/90153175b32ab92c962d782ecbccbc6c5ea02dd7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-210971265
  
    Please excuse my pings, @cloud-fan , @rxin , @falaki , @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12268: [SPARK-14480][SQL] Simplify CSV parsing process w...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/12268


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207737596
  
    **[Test build #55428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55428/consoleFull)** for PR 12268 at commit [`b5a9669`](https://github.com/apache/spark/commit/b5a966962845011d6d56bb45d704d83eb5e06e38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61271578
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    +        } else {
    +          tokens
    +        }
    +        try {
    +          val row = convertTokens(
    +            indexSafeTokens,
    +            safeRequiredIndices,
    +            schemaFields,
    +            requiredSize,
    +            options)
    +          Some(row)
    +        } catch {
    +          case NonFatal(e) if options.dropMalformed =>
    +            logWarning("Parse exception. " +
    +              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +            None
    +        }
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Convert the tokens to [[InternalRow]]
    +   */
    +  private def convertTokens(
    +      tokens: Array[String],
    +      requiredIndices: Array[Int],
    +      schemaFields: Array[StructField],
    +      requiredSize: Int,
    +      options: CSVOptions): InternalRow = {
    +    val row = new GenericMutableRow(requiredSize)
    --- End diff --
    
    I am not sure about datasources, but in a lot of places within SparkSQL we just return update a single row and return that over and over.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61265122
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    +  /**
    +   * Transforms a single InternalRow to CSV using Univocity
    +   *
    +   * @param rowSchema the schema object used for conversion
    +   * @param writer a CsvWriter object
    +   * @param headers headers to write
    +   * @param writeHeader true if it needs to write header
    +   * @param options CSVOptions object containing options
    +   * @param row The row to convert
    +   */
    +  def apply(
    +      rowSchema: StructType,
    +      writer: CsvWriter,
    +      headers: Array[String],
    +      writeHeader: Boolean,
    +      options: CSVOptions)(row: InternalRow): Unit = {
    +    val tokens = {
    +      row.toSeq(rowSchema).map { field =>
    --- End diff --
    
    You are calling this alot right? So it might be better not to rely on functional constructs here. Also take a look at the `InternalRow.toSeq` method there might be some room improvement if you just pass in the `DataType`s directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61258326
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    --- End diff --
    
    Are we ever going to use a different generator? Why not call it `CsvGenerator`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207750669
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61258077
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/InferSchema.scala ---
    @@ -30,22 +30,37 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
     import org.apache.spark.sql.types._
     import org.apache.spark.unsafe.types.UTF8String
     
    -private[csv] object CSVInferSchema {
    +private[csv] object InferSchema {
     
       /**
        * Similar to the JSON schema inference
        *     1. Infer type of each row
        *     2. Merge row types to find common type
        *     3. Replace any null types with string type
        */
    -  def infer(
    -      tokenRdd: RDD[Array[String]],
    -      header: Array[String],
    -      nullValue: String = ""): StructType = {
    +  def infer(csv: RDD[String], options: CSVOptions): StructType = {
    --- End diff --
    
    This looks very similar to `DefaultSource.inferSchema` why not move the common functionality  into a single method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59334349
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,153 +17,197 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
    +import org.apache.spark.broadcast.Broadcast
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.PartitionedFile
    +import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeProjection}
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources.{CompressionCodecs, HadoopFileLinesReader, PartitionedFile}
     import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
    +import org.apache.spark.util.collection.BitSet
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
    +
    +  override def inferSchema(
    +      sqlContext: SQLContext,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    // TODO: Move filtering.
    --- End diff --
    
    ok - it might be worth while figuring out what this TODO is and clarifying while we are changing the code around it anyways but I can understand just leaving as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217603358
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61358822
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,152 +17,162 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile}
    +import org.apache.spark.sql.catalyst.expressions.JoinedRow
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources._
    +import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def hashCode(): Int = getClass.hashCode()
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +  override def inferSchema(
    +      sparkSession: SparkSession,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
    +
    +    // TODO: Move filtering.
    +    val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString)
    --- End diff --
    
    I see, I cannot guarantee. JSON data source also skip `name.startsWith("_") || name.startsWith(".")` Let me follow this first. Can I maybe do this together with JSON data source after figuring out in a separate PR or a follow-up?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207909563
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55459/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208812366
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214608176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56959/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59333718
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,153 +17,197 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
    +import org.apache.spark.broadcast.Broadcast
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.PartitionedFile
    +import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeProjection}
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources.{CompressionCodecs, HadoopFileLinesReader, PartitionedFile}
     import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
    +import org.apache.spark.util.collection.BitSet
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
    +
    +  override def inferSchema(
    +      sqlContext: SQLContext,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    // TODO: Move filtering.
    --- End diff --
    
    Thanks! Actually the comment was created by another guy. Anyway I will create some JIRAs for todos I just made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61359253
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    +        } else {
    +          tokens
    +        }
    +        try {
    +          val row = convertTokens(
    +            indexSafeTokens,
    +            safeRequiredIndices,
    +            schemaFields,
    +            requiredSize,
    +            options)
    +          Some(row)
    +        } catch {
    +          case NonFatal(e) if options.dropMalformed =>
    +            logWarning("Parse exception. " +
    +              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +            None
    +        }
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Convert the tokens to [[InternalRow]]
    +   */
    +  private def convertTokens(
    --- End diff --
    
    I see. Could I do this as well in a separate PR with the purpose of this? Codes were just copied from the original and I just made a function to separate this with the consistent name with JSON data source `convertXXXXX()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61269972
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    +        } else {
    +          tokens
    +        }
    +        try {
    +          val row = convertTokens(
    +            indexSafeTokens,
    +            safeRequiredIndices,
    +            schemaFields,
    +            requiredSize,
    +            options)
    +          Some(row)
    +        } catch {
    +          case NonFatal(e) if options.dropMalformed =>
    +            logWarning("Parse exception. " +
    +              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +            None
    +        }
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Convert the tokens to [[InternalRow]]
    +   */
    +  private def convertTokens(
    +      tokens: Array[String],
    +      requiredIndices: Array[Int],
    +      schemaFields: Array[StructField],
    +      requiredSize: Int,
    --- End diff --
    
    Nevermind I got it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216090185
  
    **[Test build #57496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57496/consoleFull)** for PR 12268 at commit [`bd510c2`](https://github.com/apache/spark/commit/bd510c2b309f1da0099205838dd7856737c8ab61).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    **[Test build #59846 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59846/consoleFull)** for PR 12268 at commit [`7abdfc1`](https://github.com/apache/spark/commit/7abdfc111166f2bf275fc4318c0ffe8836dcbb70).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214609981
  
    This was due to https://github.com/apache/spark/commit/d2614eaadb93a48fba27fe7de64aff942e345f8e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207897243
  
    **[Test build #55457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55457/consoleFull)** for PR 12268 at commit [`f65592b`](https://github.com/apache/spark/commit/f65592b6026f9b346b12e70faf4365e55c22d6e6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207750670
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55428/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59180462
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVParserSuite.scala ---
    @@ -1,125 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.execution.datasources.csv
    -
    -import org.apache.spark.SparkFunSuite
    -
    -/**
    - * test cases for StringIteratorReader
    - */
    -class CSVParserSuite extends SparkFunSuite {
    --- End diff --
    
    I just noticed that JSON data source also does not have some tests for `JacksonParser` and `JacksonGenerator`. Could I maybe add some tests for `JacksonParser`, `JacksonGenerator`, `UnivocityParser and UnivocityGenerator in another PR or a follow-up if you think it needs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207694129
  
    I have two things to note.
    - It looks `buildInternalScan()` is not called but just remaining. Just in case I did not remove this and tested this. 
    - It looks the changes are a lot but the logics and filtering are not changed much except that it uses `Iterator` instead of `Reader`.
    - Performance was tested with this patch. This can be found in https://issues.apache.org/jira/browse/SPARK-14480.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207706409
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55415/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59148216
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVParserSuite.scala ---
    @@ -1,125 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.execution.datasources.csv
    -
    -import org.apache.spark.SparkFunSuite
    -
    -/**
    - * test cases for StringIteratorReader
    - */
    -class CSVParserSuite extends SparkFunSuite {
    --- End diff --
    
    Should we add a new parser suite for csv?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-213663159
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207913532
  
    **[Test build #55461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55461/consoleFull)** for PR 12268 at commit [`262f346`](https://github.com/apache/spark/commit/262f3466c67718d8f2648ebb80da3cdc01c0baf1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207706408
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Sounds good. Please be surgical in each pr.
    
    Would also be great to include benchmark results in pr description. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217603359
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58046/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221157064
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59175/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221147706
  
    **[Test build #59175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59175/consoleFull)** for PR 12268 at commit [`66b1757`](https://github.com/apache/spark/commit/66b17570a8d1ad53b5073bbfa439eb01b05413c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214624336
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59192246
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVParserSuite.scala ---
    @@ -1,125 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.execution.datasources.csv
    -
    -import org.apache.spark.SparkFunSuite
    -
    -/**
    - * test cases for StringIteratorReader
    - */
    -class CSVParserSuite extends SparkFunSuite {
    --- End diff --
    
    yea makes sense, we can do it in follow-ups


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214608678
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    **[Test build #60751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60751/consoleFull)** for PR 12268 at commit [`7abdfc1`](https://github.com/apache/spark/commit/7abdfc111166f2bf275fc4318c0ffe8836dcbb70).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216094877
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57496/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221156828
  
    **[Test build #59174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59174/consoleFull)** for PR 12268 at commit [`d1f616e`](https://github.com/apache/spark/commit/d1f616e2880e1100f9ffe71981a6039720d0eff4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207729067
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55420/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214625564
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56965/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-209408827
  
    @yhuai Could you please review this? I don't want to keep resolving conflicts and I am pretty sure that this is a sensible PR. 
    
    This PR touches pretty a lot of files so this causes many conflicts. I would appreciate that if this is reviewed quicker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-209473993
  
    cc @rxin , do you know who is the original author of the CSV part?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r59268351
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,153 +17,197 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
    +import org.apache.spark.broadcast.Broadcast
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.PartitionedFile
    +import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeProjection}
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources.{CompressionCodecs, HadoopFileLinesReader, PartitionedFile}
     import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
    +import org.apache.spark.util.collection.BitSet
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
    +
    +  override def inferSchema(
    +      sqlContext: SQLContext,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    // TODO: Move filtering.
    --- End diff --
    
    Maybe unify this TODO into the same format as the one in JSON relation which is a bit clearer (e.g. "TODO: Filter files for all formats before calling buildInternalScan.")) assuming this is the same filtering you are referring to? Also maybe create a JIRA (or link to existing) since TODOs comments are easy to loose track of.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    **[Test build #59846 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59846/consoleFull)** for PR 12268 at commit [`7abdfc1`](https://github.com/apache/spark/commit/7abdfc111166f2bf275fc4318c0ffe8836dcbb70).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-215928432
  
    Since this is almost a complete rewrite, I think we should only consider it early in the release cycle, i.e. for 2.1, not for 2.0 when we are so close.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-213658002
  
    @rxin Could you please review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217603319
  
    **[Test build #58046 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58046/consoleFull)** for PR 12268 at commit [`f2234e3`](https://github.com/apache/spark/commit/f2234e3f7bac02c396a8638f69baab740bc83bb1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NoSuchPermanentFunctionException(db: String, func: String)`
      * `class NoSuchFunctionException(db: String, func: String)`
      * `case class GetExternalRowField(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207750612
  
    **[Test build #55428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55428/consoleFull)** for PR 12268 at commit [`b5a9669`](https://github.com/apache/spark/commit/b5a966962845011d6d56bb45d704d83eb5e06e38).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214613795
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56963/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-212206817
  
    **[Test build #56309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56309/consoleFull)** for PR 12268 at commit [`d9ea3cb`](https://github.com/apache/spark/commit/d9ea3cb5ccb8db5d8ff9e36fa1e8d4df45ea4fb2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214609239
  
    **[Test build #56961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56961/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `sealed abstract class LDAModel protected[ml] (`
      * `class LocalLDAModel protected[ml] (`
      * `class DistributedLDAModel protected[ml] (`
      * `class ContinuousQueryManager(sparkSession: SparkSession) `
      * `class DataFrameReader protected[sql](sparkSession: SparkSession) extends Logging `
      * `class Dataset[T] protected[sql](`
      * `class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) `
      * `class FileStreamSinkLog(sparkSession: SparkSession, path: String)`
      * `class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String)`
      * `class StreamFileCatalog(sparkSession: SparkSession, path: Path) extends FileCatalog with Logging `
      * `case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216097352
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-218948489
  
    **[Test build #58538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58538/consoleFull)** for PR 12268 at commit [`cbb1674`](https://github.com/apache/spark/commit/cbb1674ecb4a82bfdb3fed97cdd14adbdd14ffb6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61270381
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    --- End diff --
    
    Do you think there is way we can do this without appending an array? Using an extra limit in `convertTokens` is probably quicker and causes less GC.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-218956628
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-213663160
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56768/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214554406
  
    ping @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221157062
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-218956629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58538/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-215109697
  
    @HyukjinKwon I have taken a pass. The PR looks pretty solid. I do think we can make it a bit more concise in some places and I do think we can make a bit faster as well. Let me know what you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61359353
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    --- End diff --
    
    Thanks for pointing this out. I will think about this further. Maybe I could do this in a separate PR if you think it is sensible. The codes were copied from original.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61358501
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    --- End diff --
    
    Thanks! The name was also taken after from JSON data source, `JacksonGenerator`. Maybe I can rename them together if this one is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61265498
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    --- End diff --
    
    Come to think of it, why not integrate this with the `CsvOutputWriter`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214613784
  
    **[Test build #56963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56963/consoleFull)** for PR 12268 at commit [`f62755e`](https://github.com/apache/spark/commit/f62755e0875ae8f2947abf8a62505dd77b2ed9f5).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61267542
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    +        } else {
    +          tokens
    +        }
    +        try {
    +          val row = convertTokens(
    +            indexSafeTokens,
    +            safeRequiredIndices,
    +            schemaFields,
    +            requiredSize,
    +            options)
    +          Some(row)
    +        } catch {
    +          case NonFatal(e) if options.dropMalformed =>
    +            logWarning("Parse exception. " +
    +              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +            None
    +        }
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Convert the tokens to [[InternalRow]]
    +   */
    +  private def convertTokens(
    +      tokens: Array[String],
    +      requiredIndices: Array[Int],
    +      schemaFields: Array[StructField],
    +      requiredSize: Int,
    --- End diff --
    
    Can an entry in `requiredIndices` lie outside of the `requiredSize` range? Why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61260986
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,152 +17,162 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile}
    +import org.apache.spark.sql.catalyst.expressions.JoinedRow
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources._
    +import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def hashCode(): Int = getClass.hashCode()
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +  override def inferSchema(
    +      sparkSession: SparkSession,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
    +
    +    // TODO: Move filtering.
    +    val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString)
    +    val rdd = createBaseRdd(sparkSession, csvOptions, paths)
    +    val schema = if (csvOptions.inferSchemaFlag) {
    +      InferSchema.infer(rdd, csvOptions)
         } else {
    -      requiredFields
    -    }
    -    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    -    schemaFields.zipWithIndex.filter {
    -      case (field, _) => safeRequiredFields.contains(field)
    -    }.foreach {
    -      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    -    }
    -    val requiredSize = requiredFields.length
    -    val row = new GenericMutableRow(requiredSize)
    -
    -    (tokens: Array[String]) => {
    -      if (params.dropMalformed && schemaFields.length != tokens.length) {
    -        logWarning(s"Dropping malformed line: ${tokens.mkString(params.delimiter.toString)}")
    -        None
    -      } else if (params.failFast && schemaFields.length != tokens.length) {
    -        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    -          s"${tokens.mkString(params.delimiter.toString)}")
    +      // By default fields are assumed to be StringType
    +      val filteredRdd = rdd.mapPartitions(CSVUtils.filterCommentAndEmpty(_, csvOptions))
    +      val firstLine = filteredRdd.first()
    +      val firstRow = UnivocityParser.tokenizeLine(firstLine, csvOptions)
    +      val header = if (csvOptions.headerFlag) {
    +        firstRow
           } else {
    -        val indexSafeTokens = if (params.permissive && schemaFields.length > tokens.length) {
    -          tokens ++ new Array[String](schemaFields.length - tokens.length)
    -        } else if (params.permissive && schemaFields.length < tokens.length) {
    -          tokens.take(schemaFields.length)
    -        } else {
    -          tokens
    -        }
    -        try {
    -          var index: Int = 0
    -          var subIndex: Int = 0
    -          while (subIndex < safeRequiredIndices.length) {
    -            index = safeRequiredIndices(subIndex)
    -            val field = schemaFields(index)
    -            // It anyway needs to try to parse since it decides if this row is malformed
    -            // or not after trying to cast in `DROPMALFORMED` mode even if the casted
    -            // value is not stored in the row.
    -            val value = CSVTypeCast.castTo(
    -              indexSafeTokens(index),
    -              field.dataType,
    -              field.nullable,
    -              params.nullValue)
    -            if (subIndex < requiredSize) {
    -              row(subIndex) = value
    -            }
    -            subIndex = subIndex + 1
    -          }
    -          Some(row)
    -        } catch {
    -          case NonFatal(e) if params.dropMalformed =>
    -            logWarning("Parse exception. " +
    -              s"Dropping malformed line: ${tokens.mkString(params.delimiter.toString)}")
    -            None
    -        }
    +        firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +      }
    +      val schemaFields = header.map { fieldName =>
    +        StructField(fieldName.toString, StringType, nullable = true)
           }
    +      StructType(schemaFields)
         }
    +    Some(schema)
       }
     
    -  def parseCsv(
    -      tokenizedRDD: RDD[Array[String]],
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      options: CSVOptions): RDD[InternalRow] = {
    -    val parser = csvParser(schema, requiredColumns, options)
    -    tokenizedRDD.flatMap(parser(_).toSeq)
    +  override def prepareWrite(
    +      sparkSession: SparkSession,
    +      job: Job,
    +      options: Map[String, String],
    +      dataSchema: StructType): OutputWriterFactory = {
    +    val conf = job.getConfiguration
    +    val csvOptions = new CSVOptions(options)
    +    csvOptions.compressionCodec.foreach { codec =>
    +      CompressionCodecs.setCodecConfiguration(conf, codec)
    --- End diff --
    
    Just out of curiosity can we also read compressed csv files?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214605811
  
    **[Test build #56955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56955/consoleFull)** for PR 12268 at commit [`ad21b8e`](https://github.com/apache/spark/commit/ad21b8eea981f61cb35de646f3568b27dd2141a3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214926444
  
    **[Test build #57058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57058/consoleFull)** for PR 12268 at commit [`ee71064`](https://github.com/apache/spark/commit/ee7106416ef17e5168a91bab044c6f6db9dbd53b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214925695
  
    @rxin No problem. Let me just rebase it if it has conflicts anyway. It is easier to track the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61358639
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    --- End diff --
    
    The name was also taken after from JSON data source, `JacksonParser`. Can I rename them together with JSON data source if this looks problematic in a follow-up or another PR if this is  sensible?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207736726
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217066498
  
    **[Test build #57834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57834/consoleFull)** for PR 12268 at commit [`a0aed27`](https://github.com/apache/spark/commit/a0aed27b7169caee50d0e97bceb6653202ba3f04).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207909562
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214608734
  
    **[Test build #56961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56961/consoleFull)** for PR 12268 at commit [`d59c7e9`](https://github.com/apache/spark/commit/d59c7e98f306fa9ff5dfe3b4caae14a2de746315).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-207897524
  
    **[Test build #55459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55459/consoleFull)** for PR 12268 at commit [`55596e1`](https://github.com/apache/spark/commit/55596e1aeb5a1a4bcbafc24075146c1f94ac6daf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214939730
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57058/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61266412
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import scala.util.control.NonFatal
    +
    +import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    +import org.apache.spark.sql.types.{StructField, StructType}
    +
    +/**
    + * Converts CSV string to a sequence of string
    + */
    +private[csv] object UnivocityParser extends Logging {
    +  /**
    +   * Convert the input iterator to a iterator having [[InternalRow]]
    +   */
    +  def parseCsv(
    +      iter: Iterator[String],
    +      schema: StructType,
    +      requiredSchema: StructType,
    +      headers: Array[String],
    +      shouldDropHeader: Boolean,
    +      options: CSVOptions): Iterator[InternalRow] = {
    +    if (shouldDropHeader) {
    +      CSVUtils.dropHeaderLine(iter, options)
    +    }
    +    val csv = CSVUtils.filterCommentAndEmpty(iter, options)
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = requiredSchema.fields
    +    val safeRequiredFields = if (options.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    +    schemaFields.zipWithIndex.filter {
    +      case (field, _) => safeRequiredFields.contains(field)
    +    }.foreach {
    +      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    +    }
    +    val requiredSize = requiredFields.length
    +
    +    tokenizeData(csv, options, headers).flatMap { tokens =>
    +      if (options.dropMalformed && schemaFields.length != tokens.length) {
    +        logWarning(s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +        None
    +      } else if (options.failFast && schemaFields.length != tokens.length) {
    +        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    +          s"${tokens.mkString(options.delimiter.toString)}")
    +      } else {
    +        val indexSafeTokens = if (options.permissive && schemaFields.length > tokens.length) {
    +          tokens ++ new Array[String](schemaFields.length - tokens.length)
    +        } else if (options.permissive && schemaFields.length < tokens.length) {
    +          tokens.take(schemaFields.length)
    +        } else {
    +          tokens
    +        }
    +        try {
    +          val row = convertTokens(
    +            indexSafeTokens,
    +            safeRequiredIndices,
    +            schemaFields,
    +            requiredSize,
    +            options)
    +          Some(row)
    +        } catch {
    +          case NonFatal(e) if options.dropMalformed =>
    +            logWarning("Parse exception. " +
    +              s"Dropping malformed line: ${tokens.mkString(options.delimiter.toString)}")
    +            None
    +        }
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Convert the tokens to [[InternalRow]]
    +   */
    +  private def convertTokens(
    --- End diff --
    
    This might be a wild idea: We might be able to use an encoder here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-218956508
  
    **[Test build #58538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58538/consoleFull)** for PR 12268 at commit [`cbb1674`](https://github.com/apache/spark/commit/cbb1674ecb4a82bfdb3fed97cdd14adbdd14ffb6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    @HyukjinKwon i just took another look at this. Can you separate this into multiple, smaller pull requests?
    
    IIUC, the main change you want to do is to get rid of the StringIteratorReader. Why don't you just submit a patch for only that? It'd make it much easier to review and merge. The problem with mixing multiple things into a single pull request is that it significantly increases the overhead to review, and makes it almost impossible to merge.
    
    We can start with this one change, and then do other refactorings later.
    
    Also it would be great to show some benchmark numbers since it is a performance improvement. The other thing I'd like to see (in a separate pull request) is to try use the Record API from univocity and compare the performance with our current data cast API.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214939729
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214609249
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61358888
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,152 +17,162 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile}
    +import org.apache.spark.sql.catalyst.expressions.JoinedRow
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources._
    +import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def hashCode(): Int = getClass.hashCode()
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +  override def inferSchema(
    +      sparkSession: SparkSession,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
    +
    +    // TODO: Move filtering.
    +    val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString)
    +    val rdd = createBaseRdd(sparkSession, csvOptions, paths)
    +    val schema = if (csvOptions.inferSchemaFlag) {
    +      InferSchema.infer(rdd, csvOptions)
         } else {
    -      requiredFields
    -    }
    -    val safeRequiredIndices = new Array[Int](safeRequiredFields.length)
    -    schemaFields.zipWithIndex.filter {
    -      case (field, _) => safeRequiredFields.contains(field)
    -    }.foreach {
    -      case (field, index) => safeRequiredIndices(safeRequiredFields.indexOf(field)) = index
    -    }
    -    val requiredSize = requiredFields.length
    -    val row = new GenericMutableRow(requiredSize)
    -
    -    (tokens: Array[String]) => {
    -      if (params.dropMalformed && schemaFields.length != tokens.length) {
    -        logWarning(s"Dropping malformed line: ${tokens.mkString(params.delimiter.toString)}")
    -        None
    -      } else if (params.failFast && schemaFields.length != tokens.length) {
    -        throw new RuntimeException(s"Malformed line in FAILFAST mode: " +
    -          s"${tokens.mkString(params.delimiter.toString)}")
    +      // By default fields are assumed to be StringType
    +      val filteredRdd = rdd.mapPartitions(CSVUtils.filterCommentAndEmpty(_, csvOptions))
    +      val firstLine = filteredRdd.first()
    +      val firstRow = UnivocityParser.tokenizeLine(firstLine, csvOptions)
    +      val header = if (csvOptions.headerFlag) {
    +        firstRow
           } else {
    -        val indexSafeTokens = if (params.permissive && schemaFields.length > tokens.length) {
    -          tokens ++ new Array[String](schemaFields.length - tokens.length)
    -        } else if (params.permissive && schemaFields.length < tokens.length) {
    -          tokens.take(schemaFields.length)
    -        } else {
    -          tokens
    -        }
    -        try {
    -          var index: Int = 0
    -          var subIndex: Int = 0
    -          while (subIndex < safeRequiredIndices.length) {
    -            index = safeRequiredIndices(subIndex)
    -            val field = schemaFields(index)
    -            // It anyway needs to try to parse since it decides if this row is malformed
    -            // or not after trying to cast in `DROPMALFORMED` mode even if the casted
    -            // value is not stored in the row.
    -            val value = CSVTypeCast.castTo(
    -              indexSafeTokens(index),
    -              field.dataType,
    -              field.nullable,
    -              params.nullValue)
    -            if (subIndex < requiredSize) {
    -              row(subIndex) = value
    -            }
    -            subIndex = subIndex + 1
    -          }
    -          Some(row)
    -        } catch {
    -          case NonFatal(e) if params.dropMalformed =>
    -            logWarning("Parse exception. " +
    -              s"Dropping malformed line: ${tokens.mkString(params.delimiter.toString)}")
    -            None
    -        }
    +        firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +      }
    +      val schemaFields = header.map { fieldName =>
    +        StructField(fieldName.toString, StringType, nullable = true)
           }
    +      StructType(schemaFields)
         }
    +    Some(schema)
       }
     
    -  def parseCsv(
    -      tokenizedRDD: RDD[Array[String]],
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      options: CSVOptions): RDD[InternalRow] = {
    -    val parser = csvParser(schema, requiredColumns, options)
    -    tokenizedRDD.flatMap(parser(_).toSeq)
    +  override def prepareWrite(
    +      sparkSession: SparkSession,
    +      job: Job,
    +      options: Map[String, String],
    +      dataSchema: StructType): OutputWriterFactory = {
    +    val conf = job.getConfiguration
    +    val csvOptions = new CSVOptions(options)
    +    csvOptions.compressionCodec.foreach { codec =>
    +      CompressionCodecs.setCodecConfiguration(conf, codec)
    --- End diff --
    
    Oh, yes it is possible since it uses `TextInputFormat`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-216094875
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-208810188
  
    **[Test build #55597 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55597/consoleFull)** for PR 12268 at commit [`9015317`](https://github.com/apache/spark/commit/90153175b32ab92c962d782ecbccbc6c5ea02dd7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61359080
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.types.StructType
    +
    +/**
    + * Converts a sequence of string to CSV string
    + */
    +private[csv] object UnivocityGenerator extends Logging {
    +  /**
    +   * Transforms a single InternalRow to CSV using Univocity
    +   *
    +   * @param rowSchema the schema object used for conversion
    +   * @param writer a CsvWriter object
    +   * @param headers headers to write
    +   * @param writeHeader true if it needs to write header
    +   * @param options CSVOptions object containing options
    +   * @param row The row to convert
    +   */
    +  def apply(
    +      rowSchema: StructType,
    +      writer: CsvWriter,
    +      headers: Array[String],
    +      writeHeader: Boolean,
    +      options: CSVOptions)(row: InternalRow): Unit = {
    +    val tokens = {
    +      row.toSeq(rowSchema).map { field =>
    --- End diff --
    
    Thank you! Could I maybe do this in a separate PR with the purpose of this? This was just copied from original codes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-217599175
  
    **[Test build #58046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58046/consoleFull)** for PR 12268 at commit [`f2234e3`](https://github.com/apache/spark/commit/f2234e3f7bac02c396a8638f69baab740bc83bb1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-221147030
  
    **[Test build #59174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59174/consoleFull)** for PR 12268 at commit [`d1f616e`](https://github.com/apache/spark/commit/d1f616e2880e1100f9ffe71981a6039720d0eff4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12268: [SPARK-14480][SQL] Simplify CSV parsing process with a b...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12268
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59846/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-214609252
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56961/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12268#discussion_r61260702
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -17,152 +17,162 @@
     
     package org.apache.spark.sql.execution.datasources.csv
     
    -import scala.util.control.NonFatal
    -
    -import org.apache.hadoop.fs.Path
    -import org.apache.hadoop.io.{NullWritable, Text}
    -import org.apache.hadoop.mapreduce.RecordWriter
    -import org.apache.hadoop.mapreduce.TaskAttemptContext
    +import java.io.CharArrayWriter
    +import java.nio.charset.{Charset, StandardCharsets}
    +
    +import com.univocity.parsers.csv.CsvWriter
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext}
     import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
     
     import org.apache.spark.internal.Logging
     import org.apache.spark.rdd.RDD
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.InternalRow
    -import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
    -import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile}
    +import org.apache.spark.sql.catalyst.expressions.JoinedRow
    +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
    +import org.apache.spark.sql.execution.datasources._
    +import org.apache.spark.sql.sources._
     import org.apache.spark.sql.types._
    +import org.apache.spark.util.SerializableConfiguration
     
    -object CSVRelation extends Logging {
    -
    -  def univocityTokenizer(
    -      file: RDD[String],
    -      header: Seq[String],
    -      firstLine: String,
    -      params: CSVOptions): RDD[Array[String]] = {
    -    // If header is set, make sure firstLine is materialized before sending to executors.
    -    file.mapPartitions { iter =>
    -      new BulkCsvReader(
    -        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    -        params,
    -        headers = header)
    -    }
    -  }
    +/**
    + * Provides access to CSV data from pure SQL statements.
    + */
    +class DefaultSource extends FileFormat with DataSourceRegister {
    +
    +  override def shortName(): String = "csv"
    +
    +  override def toString: String = "CSV"
    +
    +  override def hashCode(): Int = getClass.hashCode()
    +
    +  override def equals(other: Any): Boolean = other.isInstanceOf[DefaultSource]
     
    -  def csvParser(
    -      schema: StructType,
    -      requiredColumns: Array[String],
    -      params: CSVOptions): Array[String] => Option[InternalRow] = {
    -    val schemaFields = schema.fields
    -    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    -    val safeRequiredFields = if (params.dropMalformed) {
    -      // If `dropMalformed` is enabled, then it needs to parse all the values
    -      // so that we can decide which row is malformed.
    -      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +  override def inferSchema(
    +      sparkSession: SparkSession,
    +      options: Map[String, String],
    +      files: Seq[FileStatus]): Option[StructType] = {
    +    val csvOptions = new CSVOptions(options)
    +
    +    // TODO: Move filtering.
    +    val paths = files.filterNot(_.getPath.getName startsWith "_").map(_.getPath.toString)
    --- End diff --
    
    code style: `_.getPath.getName.startsWith("_")`?
    
    Is it safe to skip all files with an underscore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14480][SQL] Simplify CSV parsing proces...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12268#issuecomment-210971456
  
    will try to take a look in the next few days.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org