You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2016/03/28 18:44:59 UTC

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/12002

    [SPARK-14206][SQL] buildReader() implementation for CSV

    ## What changes were proposed in this pull request?
    
    Major changes:
    
    1. Implement `FileFormat.buildReader()` for the CSV data source.
    1. Add an extra argument to `FileFormat.buildReader()`, `physicalSchema`, which is basically the result of `FileFormat.inferSchema` or user specified schema.
    
       This argument is necessary because the CSV data source needs to know all the columns of the underlying files to read the file.
    
    ## How was this patch tested?
    
    Existing tests should do the work.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-14206-csv-build-reader

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12002.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12002
    
----
commit e559ba58f7a9972f856c192c436e8cf9778e35cb
Author: Cheng Lian <li...@databricks.com>
Date:   2016-03-28T16:35:20Z

    buildReader() implementation for CSV

commit dd2afe6c2e89d4ec68f0d5b00c3bea162947c4ed
Author: Cheng Lian <li...@databricks.com>
Date:   2016-03-28T16:37:49Z

    Fixes import order

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202483265
  
    cc @marmbrus @yhuai @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202525548
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203521435
  
    **[Test build #54534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54534/consoleFull)** for PR 12002 at commit [`1c48f7b`](https://github.com/apache/spark/commit/1c48f7b2274039400928aa74e02b0db70380df7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202544872
  
    **[Test build #54330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54330/consoleFull)** for PR 12002 at commit [`20aabfc`](https://github.com/apache/spark/commit/20aabfc77e21c203e7267a723b72a33ed18e2e21).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202803372
  
    **[Test build #54424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54424/consoleFull)** for PR 12002 at commit [`292ad6d`](https://github.com/apache/spark/commit/292ad6d0465e6c43a0d34e7cd82036c6b1cd7fcd).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202942322
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57597324
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -291,6 +291,7 @@ private[sql] class DefaultSource
        */
       override def buildReader(
           sqlContext: SQLContext,
    +      physicalSchema: StructType,
           partitionSchema: StructType,
           dataSchema: StructType,
    --- End diff --
    
    Can you change this name too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203408476
  
    **[Test build #54518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54518/consoleFull)** for PR 12002 at commit [`e2c9ce1`](https://github.com/apache/spark/commit/e2c9ce1f99bb1b2ddad48c7c0638f8486042dff7).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57596762
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +96,70 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val fileSplit = {
    +        val filePath = new Path(new URI(file.filePath))
    +        new FileSplit(filePath, file.start, file.length, Array.empty)
    +      }
    +
    +      val hadoopAttemptContext = {
    +        val conf = broadcastedConf.value.value
    +        val attemptID = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    +        new TaskAttemptContextImpl(conf, attemptID)
    +      }
    +
    +      val reader = new LineRecordReader()
    +      reader.initialize(fileSplit, hadoopAttemptContext)
    +
    +      val lineIterator = new RecordReaderIterator(reader).map { line =>
    +        new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    +      }
    +
    +      // Skips the header line of each file if the `header` option is set to true.
    +      // TODO What if the first partitioned file consists of only comments and empty lines?
    +      if (csvOptions.headerFlag && file.start == 0) {
    +        val nonEmptyLines = if (csvOptions.isCommentSet) {
    +          val commentPrefix = csvOptions.comment.toString
    +          lineIterator.dropWhile { line =>
    +            line.trim.isEmpty || line.trim.startsWith(commentPrefix)
    +          }
    +        } else {
    +          lineIterator.dropWhile(_.trim.isEmpty)
    +        }
    +
    +        if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
    +      }
    +
    +      val unsafeRowIterator = {
    +        val tokenizedIterator = new BulkCsvReader(lineIterator, csvOptions, headers)
    +        val parser = CSVRelation.csvParser(physicalSchema, dataSchema.fieldNames, csvOptions)
    --- End diff --
    
    This is where we have to use the full physical schema of actual underlying data files. It's why the `physicalSchema` argument is added to `FileFormat.buildReader()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57755370
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -49,14 +49,10 @@ object CSVRelation extends Logging {
         }, true)
       }
     
    -  def parseCsv(
    -      tokenizedRDD: RDD[Array[String]],
    +  def csvParser(
    --- End diff --
    
    So won't this add some overhead per-row? Were going to recompute safeRequiredIndices multiple times.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57747564
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -112,8 +105,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
           val readFile = files.fileFormat.buildReader(
             sqlContext = files.sqlContext,
    +        physicalSchema = files.dataSchema,
    --- End diff --
    
    why do we call it `physicalSchema`? It's kind of a reconciled schema that may not equal to the real physical schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203473232
  
    retest this please.
    
    The last build failure was caused by an irrelevant test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202941534
  
    **[Test build #54438 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54438/consoleFull)** for PR 12002 at commit [`2f1ae7f`](https://github.com/apache/spark/commit/2f1ae7f118c8e2fdc39b6a896c6f98e80f882180).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203457833
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54519/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203457346
  
    **[Test build #54519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54519/consoleFull)** for PR 12002 at commit [`1c48f7b`](https://github.com/apache/spark/commit/1c48f7b2274039400928aa74e02b0db70380df7d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203345372
  
    **[Test build #54508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54508/consoleFull)** for PR 12002 at commit [`84eddff`](https://github.com/apache/spark/commit/84eddfff375cf6476e82c4c490faf3416c7b08ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57594679
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,8 +131,7 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
               logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes")
     
               val splitFiles = selectedPartitions.flatMap { partition =>
    -            partition.files.flatMap { file =>
    -              assert(file.getLen != 0, file.toString)
    +            partition.files.filter(_.getLen > 0).flatMap { file =>
    --- End diff --
    
    Existing CSV test cases covers empty input file. This change is no longer necessary if #11999 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203406791
  
    **[Test build #54518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54518/consoleFull)** for PR 12002 at commit [`e2c9ce1`](https://github.com/apache/spark/commit/e2c9ce1f99bb1b2ddad48c7c0638f8486042dff7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203386793
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202545836
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54330/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57931452
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -385,9 +385,9 @@ abstract class OutputWriter {
      *
      * @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
      *                 this relation.
    - * @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
    + * @param partitionSchema The schema of the columns (if any) that are used to partition the relation
      * @param dataSchema The schema of any remaining columns.  Note that if any partition columns are
    - *                   present in the actual data files as well, they are removed.
    + *                   present in the actual data files as well, they are preserved.
    --- End diff --
    
    Looks like it is pretty confusing that `dataSchema` can have different semantics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57595668
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -113,8 +182,7 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         val pathsString = csvFiles.map(_.getPath.toUri.toString)
         val header = dataSchema.fields.map(_.name)
         val tokenizedRdd = tokenRdd(sqlContext, csvOptions, header, pathsString)
    -    val rows = CSVRelation.parseCsv(
    -      tokenizedRdd, dataSchema, requiredColumns, csvFiles, sqlContext, csvOptions)
    --- End diff --
    
    We don't actually use `sqlContext` or `csvFiles` (a list of `FileStatus`) within `parseCsv`, so removed them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203408489
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54518/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202800182
  
    **[Test build #54424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54424/consoleFull)** for PR 12002 at commit [`292ad6d`](https://github.com/apache/spark/commit/292ad6d0465e6c43a0d34e7cd82036c6b1cd7fcd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203386795
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54508/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202545832
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203520995
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202480562
  
    **[Test build #54326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54326/consoleFull)** for PR 12002 at commit [`dd2afe6`](https://github.com/apache/spark/commit/dd2afe6c2e89d4ec68f0d5b00c3bea162947c4ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203386557
  
    **[Test build #54508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54508/consoleFull)** for PR 12002 at commit [`84eddff`](https://github.com/apache/spark/commit/84eddfff375cf6476e82c4c490faf3416c7b08ab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203568930
  
    **[Test build #54534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54534/consoleFull)** for PR 12002 at commit [`1c48f7b`](https://github.com/apache/spark/commit/1c48f7b2274039400928aa74e02b0db70380df7d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57595515
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +96,70 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val fileSplit = {
    +        val filePath = new Path(new URI(file.filePath))
    +        new FileSplit(filePath, file.start, file.length, Array.empty)
    +      }
    +
    +      val hadoopAttemptContext = {
    +        val conf = broadcastedConf.value.value
    +        val attemptID = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    +        new TaskAttemptContextImpl(conf, attemptID)
    +      }
    +
    +      val reader = new LineRecordReader()
    +      reader.initialize(fileSplit, hadoopAttemptContext)
    +
    +      val lineIterator = new RecordReaderIterator(reader).map { line =>
    +        new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    +      }
    +
    +      // Skips the header line of each file if the `header` option is set to true.
    +      // TODO What if the first partitioned file consists of only comments and empty lines?
    +      if (csvOptions.headerFlag && file.start == 0) {
    +        val nonEmptyLines = if (csvOptions.isCommentSet) {
    +          val commentPrefix = csvOptions.comment.toString
    +          lineIterator.dropWhile { line =>
    +            line.trim.isEmpty || line.trim.startsWith(commentPrefix)
    +          }
    +        } else {
    +          lineIterator.dropWhile(_.trim.isEmpty)
    +        }
    +
    +        if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
    +      }
    +
    +      val unsafeRowIterator = {
    +        val tokenizedIterator = new BulkCsvReader(lineIterator, csvOptions, headers)
    +        val parser = CSVRelation.csvParser(physicalSchema, dataSchema.fieldNames, csvOptions)
    +        tokenizedIterator.flatMap(parser(_).toSeq)
    +      }
    +
    +      // Appends partition values
    +      val fullOutput = dataSchema.toAttributes ++ partitionSchema.toAttributes
    +      val joinedRow = new JoinedRow()
    +      val appendPartitionColumns = GenerateUnsafeProjection.generate(fullOutput, fullOutput)
    +
    +      unsafeRowIterator.map { dataRow =>
    +        appendPartitionColumns(joinedRow(dataRow, file.partitionValues))
    +      }
    --- End diff --
    
    Partition value appending is going to be extracted into a separate method and reused in all data sources in follow-up PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57827042
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -112,8 +105,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
           val readFile = files.fileFormat.buildReader(
             sqlContext = files.sqlContext,
    +        physicalSchema = files.dataSchema,
    --- End diff --
    
    how about just `dataSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202525551
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54326/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203701990
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57595023
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +96,70 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val fileSplit = {
    +        val filePath = new Path(new URI(file.filePath))
    +        new FileSplit(filePath, file.start, file.length, Array.empty)
    +      }
    +
    +      val hadoopAttemptContext = {
    +        val conf = broadcastedConf.value.value
    +        val attemptID = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    +        new TaskAttemptContextImpl(conf, attemptID)
    +      }
    +
    +      val reader = new LineRecordReader()
    +      reader.initialize(fileSplit, hadoopAttemptContext)
    +
    +      val lineIterator = new RecordReaderIterator(reader).map { line =>
    +        new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    +      }
    --- End diff --
    
    `RecordReaderIterator` instantiation can be simplified using [`HadoopFileLinesReader`][1] introduced in PR #11960.
    
    [1]: https://github.com/apache/spark/pull/11960/files#diff-c0332718b0741d1b84c7157120b87888R33


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203569282
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203695683
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12002#discussion_r57986182

--- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -385,9 +385,9 @@ abstract class OutputWriter {
*
* @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
* this relation.
- * @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
+ * @param partitionSchema The schema of the columns (if any) that are used to partition the relation
* @param dataSchema The schema of any remaining columns. Note that if any partition columns are
- * present in the actual data files as well, they are removed.
+ * present in the actual data files as well, they are preserved.
--- End diff --

OK. dataSchema used by HadoopFsRelation and FileFormat's buildReader is the same thing.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203419884
  
    **[Test build #54519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54519/consoleFull)** for PR 12002 at commit [`1c48f7b`](https://github.com/apache/spark/commit/1c48f7b2274039400928aa74e02b0db70380df7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202942326
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54438/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202525074
  
    **[Test build #54326 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54326/consoleFull)** for PR 12002 at commit [`dd2afe6`](https://github.com/apache/spark/commit/dd2afe6c2e89d4ec68f0d5b00c3bea162947c4ed).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57759347
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +93,46 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val lineIterator = {
    +        val conf = broadcastedConf.value.value
    +        new HadoopFileLinesReader(file, conf).map { line =>
    +          new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    --- End diff --
    
    No, we need to create the string according to given character set. (And just in case you missed, that `line` is a `Text` rather than a `String`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57594757
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -49,14 +49,10 @@ object CSVRelation extends Logging {
         }, true)
       }
     
    -  def parseCsv(
    -      tokenizedRDD: RDD[Array[String]],
    +  def csvParser(
    --- End diff --
    
    Refactored from the original `parseCsv` method so that the main logic doesn't depend on RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57932556
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -385,9 +385,9 @@ abstract class OutputWriter {
      *
      * @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
      *                 this relation.
    - * @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
    + * @param partitionSchema The schema of the columns (if any) that are used to partition the relation
      * @param dataSchema The schema of any remaining columns.  Note that if any partition columns are
    - *                   present in the actual data files as well, they are removed.
    + *                   present in the actual data files as well, they are preserved.
    --- End diff --
    
    Can you explain it more? Where do we preserve the partition columns in this struct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57760192
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -49,14 +49,10 @@ object CSVRelation extends Logging {
         }, true)
       }
     
    -  def parseCsv(
    -      tokenizedRDD: RDD[Array[String]],
    +  def csvParser(
    --- End diff --
    
    This method returns a closure, while `safeRequiredIndices` is computed only once out of the scope of the closure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57760314
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -112,8 +105,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
           val readFile = files.fileFormat.buildReader(
             sqlContext = files.sqlContext,
    +        physicalSchema = files.dataSchema,
    --- End diff --
    
    Yea... Maybe `globalSchema` or `mergedSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202803379
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202831915
  
    **[Test build #54426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54426/consoleFull)** for PR 12002 at commit [`e6ed363`](https://github.com/apache/spark/commit/e6ed3637c31bb0f6b139004a99066b2d3101d277).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57595420
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +96,70 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val fileSplit = {
    +        val filePath = new Path(new URI(file.filePath))
    +        new FileSplit(filePath, file.start, file.length, Array.empty)
    +      }
    +
    +      val hadoopAttemptContext = {
    +        val conf = broadcastedConf.value.value
    +        val attemptID = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    +        new TaskAttemptContextImpl(conf, attemptID)
    +      }
    +
    +      val reader = new LineRecordReader()
    +      reader.initialize(fileSplit, hadoopAttemptContext)
    +
    +      val lineIterator = new RecordReaderIterator(reader).map { line =>
    +        new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    +      }
    +
    +      // Skips the header line of each file if the `header` option is set to true.
    +      // TODO What if the first partitioned file consists of only comments and empty lines?
    +      if (csvOptions.headerFlag && file.start == 0) {
    +        val nonEmptyLines = if (csvOptions.isCommentSet) {
    +          val commentPrefix = csvOptions.comment.toString
    +          lineIterator.dropWhile { line =>
    +            line.trim.isEmpty || line.trim.startsWith(commentPrefix)
    +          }
    +        } else {
    +          lineIterator.dropWhile(_.trim.isEmpty)
    +        }
    +
    +        if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
    +      }
    --- End diff --
    
    This block should probably be refactored into a separate method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57747923
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +93,46 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val lineIterator = {
    +        val conf = broadcastedConf.value.value
    +        new HadoopFileLinesReader(file, conf).map { line =>
    +          new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    --- End diff --
    
    is it equal to `line.toString`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202500628
  
    **[Test build #54330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54330/consoleFull)** for PR 12002 at commit [`20aabfc`](https://github.com/apache/spark/commit/20aabfc77e21c203e7267a723b72a33ed18e2e21).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57905239
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -385,9 +385,9 @@ abstract class OutputWriter {
      *
      * @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
      *                 this relation.
    - * @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
    + * @param partitionSchema The schema of the columns (if any) that are used to partition the relation
      * @param dataSchema The schema of any remaining columns.  Note that if any partition columns are
    - *                   present in the actual data files as well, they are removed.
    + *                   present in the actual data files as well, they are preserved.
    --- End diff --
    
    So we don't need to append `partitionSchema` to `dataSchema` in `HadoopFsRelation.schema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12002#discussion_r57860269

--- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -462,20 +462,24 @@ trait FileFormat {
/**
* Returns a function that can be used to read a single file in as an Iterator of InternalRow.
*
+ * @param dataSchema The global data schema. It can be either specified by the user, or
+ * reconciled/merged from all underlying data files. If any partition columns
+ * are contained in the files, they are preserved in this schema.
* @param partitionSchema The schema of the partition column row that will be present in each
- * PartitionedFile. These columns should be prepended to the rows that
+ * PartitionedFile. These columns should be appended to the rows that
--- End diff --

Typo fixed here. Partition values are appended rather than prepended to produced rows.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57748886
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -112,8 +105,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
           val readFile = files.fileFormat.buildReader(
             sqlContext = files.sqlContext,
    +        physicalSchema = files.dataSchema,
    --- End diff --
    
    For example, it's not the real physical schema in ORC relation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203408488
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57651085
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
    @@ -145,15 +146,15 @@ private[sql] class DefaultSource
         (file: PartitionedFile) => {
           val conf = broadcastedConf.value.value
     
    -      // SPARK-8501: Empty ORC files always have an empty schema stored in their footer.  In this
    -      // case, `OrcFileOperator.readSchema` returns `None`, and we can simply return an empty
    -      // iterator.
    +      // SPARK-8501: Empty ORC files always have an empty schema stored in their footer. In this
    +      // case, `OrcFileOperator.readSchema` returns `None`, and we can't read the underlying file
    +      // using the given physical schema. Instead, we simply return an empty iterator.
           val maybePhysicalSchema = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf))
           if (maybePhysicalSchema.isEmpty) {
             Iterator.empty
           } else {
             val physicalSchema = maybePhysicalSchema.get
    -        OrcRelation.setRequiredColumns(conf, physicalSchema, dataSchema)
    +        OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema)
    --- End diff --
    
    Note that this `physicalSchema` is NOT the one passed to `buildReader`, but the one just read from physical data file to be scanned. Otherwise it breaks existing test cases related to case sensitivity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202889081
  
    **[Test build #54438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54438/consoleFull)** for PR 12002 at commit [`2f1ae7f`](https://github.com/apache/spark/commit/2f1ae7f118c8e2fdc39b6a896c6f98e80f882180).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202832136
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57759763
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -385,9 +385,9 @@ abstract class OutputWriter {
      *
      * @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
      *                 this relation.
    - * @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
    + * @param partitionSchema The schema of the columns (if any) that are used to partition the relation
      * @param dataSchema The schema of any remaining columns.  Note that if any partition columns are
    - *                   present in the actual data files as well, they are removed.
    + *                   present in the actual data files as well, they are preserved.
    --- End diff --
    
    The `dataSchema` you mentioned above is argument of `buildReader()`, which does get partitioned columns stripped. But it's different from `FileFormat.dataSchema`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202803380
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54424/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57595359
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -91,6 +96,70 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         new CSVOutputWriterFactory(csvOptions)
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      physicalSchema: StructType,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
    +    val csvOptions = new CSVOptions(options)
    +    val headers = dataSchema.fields.map(_.name)
    +
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf = sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    (file: PartitionedFile) => {
    +      val fileSplit = {
    +        val filePath = new Path(new URI(file.filePath))
    +        new FileSplit(filePath, file.start, file.length, Array.empty)
    +      }
    +
    +      val hadoopAttemptContext = {
    +        val conf = broadcastedConf.value.value
    +        val attemptID = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    +        new TaskAttemptContextImpl(conf, attemptID)
    +      }
    +
    +      val reader = new LineRecordReader()
    +      reader.initialize(fileSplit, hadoopAttemptContext)
    +
    +      val lineIterator = new RecordReaderIterator(reader).map { line =>
    +        new String(line.getBytes, 0, line.getLength, csvOptions.charset)
    +      }
    +
    +      // Skips the header line of each file if the `header` option is set to true.
    +      // TODO What if the first partitioned file consists of only comments and empty lines?
    --- End diff --
    
    Unlike RDD, here we no longer have a global view of all the input data. Thus filtering the first header line becomes difficult. Here we only handle the simplest and most common use case, namely we can find a valid header in the first block of any given input file. If the first block contains only comments and empty lines, this branch fails. Haven't got a good solution for this yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202805523
  
    **[Test build #54426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54426/consoleFull)** for PR 12002 at commit [`e6ed363`](https://github.com/apache/spark/commit/e6ed3637c31bb0f6b139004a99066b2d3101d277).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57860229
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -276,38 +276,26 @@ private[sql] class DefaultSource
             file.getName == ParquetFileWriter.PARQUET_METADATA_FILE
       }
     
    -  /**
    -   * Returns a function that can be used to read a single file in as an Iterator of InternalRow.
    -   *
    -   * @param partitionSchema The schema of the partition column row that will be present in each
    -   *                        PartitionedFile.  These columns should be prepended to the rows that
    -   *                        are produced by the iterator.
    -   * @param dataSchema The schema of the data that should be output for each row.  This may be a
    -   *                   subset of the columns that are present in the file if  column pruning has
    -   *                   occurred.
    -   * @param filters A set of filters than can optionally be used to reduce the number of rows output
    -   * @param options A set of string -> string configuration options.
    -   * @return
    -   */
    --- End diff --
    
    This ScalaDoc duplicates the one in `FileFormat.buildReader()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-202832138
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54426/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203457828
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57748633
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -385,9 +385,9 @@ abstract class OutputWriter {
      *
      * @param location A [[FileCatalog]] that can enumerate the locations of all the files that comprise
      *                 this relation.
    - * @param partitionSchema The schmea of the columns (if any) that are used to partition the relation
    + * @param partitionSchema The schema of the columns (if any) that are used to partition the relation
      * @param dataSchema The schema of any remaining columns.  Note that if any partition columns are
    - *                   present in the actual data files as well, they are removed.
    + *                   present in the actual data files as well, they are preserved.
    --- End diff --
    
    But when we append partitioned values, we do `val fullOutput = dataSchema.toAttributes ++ partitionSchema.toAttributes`. It looks like the partitioned columns are excluded in `dataSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57601851
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala ---
    @@ -291,6 +291,7 @@ private[sql] class DefaultSource
        */
       override def buildReader(
           sqlContext: SQLContext,
    +      physicalSchema: StructType,
           partitionSchema: StructType,
           dataSchema: StructType,
    --- End diff --
    
    Sure, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57594467
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -78,14 +79,6 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
           val dataColumns =
             l.resolve(files.dataSchema, files.sqlContext.sessionState.analyzer.resolver)
     
    -      val bucketColumns =
    -        AttributeSet(
    -          files.bucketSpec
    -            .map(_.bucketColumnNames)
    -            .getOrElse(Nil)
    -            .map(l.resolveQuoted(_, files.sqlContext.conf.resolver)
    -              .getOrElse(sys.error(""))))
    -
    --- End diff --
    
    This variable isn't used anywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12002#discussion_r57650888

I found that partition columns are NOT removed from `dataSchema`. This ScalaDoc should probably be a typo.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12002


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57845556
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -112,8 +105,9 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
           val readFile = files.fileFormat.buildReader(
             sqlContext = files.sqlContext,
    +        physicalSchema = files.dataSchema,
    --- End diff --
    
    To keep consistent with the previous `buildIntenralScan`, I prefer to call them `dataSchema` and `requiredSchema`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12002#issuecomment-203569287
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54534/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14206][SQL] buildReader() implementatio...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12002#discussion_r57760761
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -49,14 +49,10 @@ object CSVRelation extends Logging {
         }, true)
       }
     
    -  def parseCsv(
    -      tokenizedRDD: RDD[Array[String]],
    +  def csvParser(
    --- End diff --
    
    Oh oops, misread the return type. yah makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org