You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2016/03/25 15:50:52 UTC

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/11960

    [SPARK-14158][SQL] implement buildReader for json data source

    ## What changes were proposed in this pull request?
    
    This PR implements buildReader for json data source and enable it in the new data source code path.
    
    
    ## How was this patch tested?
    
    existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark json

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11960
    
----
commit 3efd4a11b6be933cc53c654fe88271441646a82a
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-03-25T14:47:25Z

    implement buildReader for json data source

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202647651
  
    **[Test build #54394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54394/consoleFull)** for PR 11960 at commit [`e69551b`](https://github.com/apache/spark/commit/e69551b39ca796ecd608b1d85079db818f284b71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202670681
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57599706
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala ---
    @@ -120,6 +122,39 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         }
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): PartitionedFile => Iterator[InternalRow] = {
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf =
    +      sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    val parsedOptions: JSONOptions = new JSONOptions(options)
    +    val columnNameOfCorruptRecord = parsedOptions.columnNameOfCorruptRecord
    +      .getOrElse(sqlContext.conf.columnNameOfCorruptRecord)
    +
    +    val fullSchema = dataSchema.toAttributes ++ partitionSchema.toAttributes
    +    val joinedRow = new JoinedRow()
    +
    +    file => {
    +      val lines = new HadoopFileLinesReader(file, broadcastedConf.value.value).map(_.toString)
    +
    +      val rows = JacksonParser.parseJson(
    +        lines,
    +        dataSchema,
    --- End diff --
    
    @yhuai @cloud-fan Is it OK if the schema passed to `parseJson` is different but compatible to the underlying input data files? I assume yes? Otherwise we might need the `physicalSchema` argument added in #12002.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201633176
  
    **[Test build #54232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54232/consoleFull)** for PR 11960 at commit [`dd6cef3`](https://github.com/apache/spark/commit/dd6cef3e2bd879eb96cf48a36923715035471e00).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57599911
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala ---
    @@ -32,7 +32,7 @@ private[json] trait TestJsonData {
               "double":1.7976931348623157E308,
               "boolean":true,
               "null":null
    -      }"""  :: Nil)
    +      }"""  :: Nil, 1)
    --- End diff --
    
    Why all the partitioning changes in testing code are necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57465176
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -58,7 +58,8 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
       def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
         case PhysicalOperation(projects, filters, l@LogicalRelation(files: HadoopFsRelation, _, _))
           if (files.fileFormat.toString == "TestFileFormat" ||
    -         files.fileFormat.isInstanceOf[parquet.DefaultSource]) &&
    +         files.fileFormat.isInstanceOf[parquet.DefaultSource] ||
    +         files.fileFormat.isInstanceOf[json.DefaultSource]) &&
              files.sqlContext.conf.parquetFileScan =>
    --- End diff --
    
    oh, I think you want to rename that flag since this code path is not just for parquet anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202504866
  
    **[Test build #54323 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54323/consoleFull)** for PR 11960 at commit [`59ae3b5`](https://github.com/apache/spark/commit/59ae3b5e46cf1b0cacfe71e97a4c6f00402d9f3a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201633729
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202670684
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54394/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201320791
  
    **[Test build #54180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54180/consoleFull)** for PR 11960 at commit [`3efd4a1`](https://github.com/apache/spark/commit/3efd4a11b6be933cc53c654fe88271441646a82a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57462957
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala ---
    @@ -120,6 +122,39 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         }
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): PartitionedFile => Iterator[InternalRow] = {
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf =
    +      sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    val parsedOptions: JSONOptions = new JSONOptions(options)
    +    val columnNameOfCorruptRecord = parsedOptions.columnNameOfCorruptRecord
    +      .getOrElse(sqlContext.conf.columnNameOfCorruptRecord)
    +
    +    val fullSchema = dataSchema.toAttributes ++ partitionSchema.toAttributes
    +    val joinedRow = new JoinedRow()
    +    val appendPartitionColumns = GenerateUnsafeProjection.generate(fullSchema, fullSchema)
    --- End diff --
    
    This object needs to be created at the executor side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57502420
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.net.URI
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce._
    +import org.apache.hadoop.mapreduce.lib.input.{FileSplit, LineRecordReader}
    +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    +
    +/**
    + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which are all of the lines
    + * in that file.
    + */
    +class HadoopFileLinesReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] {
    +  private val iterator = {
    +    val fileSplit = new FileSplit(
    +      new Path(new URI(file.filePath)),
    +      file.start,
    +      file.length,
    +      Array.empty)
    --- End diff --
    
    yes, and I just abstracted this logic from the parquet relation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57602134
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala ---
    @@ -32,7 +32,7 @@ private[json] trait TestJsonData {
               "double":1.7976931348623157E308,
               "boolean":true,
               "null":null
    -      }"""  :: Nil)
    +      }"""  :: Nil, 1)
    --- End diff --
    
    Even the FileSourceStraetgy can handle empty files, it's still good to not write out empty files, that's the reason I updated these tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202739924
  
    merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202669899
  
    **[Test build #54394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54394/consoleFull)** for PR 11960 at commit [`e69551b`](https://github.com/apache/spark/commit/e69551b39ca796ecd608b1d85079db818f284b71).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57600394
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala ---
    @@ -32,7 +32,7 @@ private[json] trait TestJsonData {
               "double":1.7976931348623157E308,
               "boolean":true,
               "null":null
    -      }"""  :: Nil)
    +      }"""  :: Nil, 1)
    --- End diff --
    
    Because our `FileSourceStrategy` can't handle empty files, see https://github.com/apache/spark/pull/11999


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201320551
  
    cc @marmbrus  @liancheng @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201345704
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202505765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54323/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202459534
  
    **[Test build #54323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54323/consoleFull)** for PR 11960 at commit [`59ae3b5`](https://github.com/apache/spark/commit/59ae3b5e46cf1b0cacfe71e97a4c6f00402d9f3a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57601454
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala ---
    @@ -32,7 +32,7 @@ private[json] trait TestJsonData {
               "double":1.7976931348623157E308,
               "boolean":true,
               "null":null
    -      }"""  :: Nil)
    +      }"""  :: Nil, 1)
    --- End diff --
    
    I guess we should get https://github.com/apache/spark/pull/11999 in first?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201633733
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54232/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57602697
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala ---
    @@ -32,7 +32,7 @@ private[json] trait TestJsonData {
               "double":1.7976931348623157E308,
               "boolean":true,
               "null":null
    -      }"""  :: Nil)
    +      }"""  :: Nil, 1)
    --- End diff --
    
    Will we lose test coverage if we make this kind of changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-202505761
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57465015
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.net.URI
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce._
    +import org.apache.hadoop.mapreduce.lib.input.{FileSplit, LineRecordReader}
    +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    +
    +/**
    + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which are all of the lines
    + * in that file.
    + */
    +class HadoopFileLinesReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] {
    +  private val iterator = {
    +    val fileSplit = new FileSplit(
    +      new Path(new URI(file.filePath)),
    +      file.start,
    +      file.length,
    +      Array.empty)
    --- End diff --
    
    Let's add a comment why we pass in `Array.empty` right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11960


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57600622
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala ---
    @@ -120,6 +122,39 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         }
       }
     
    +  override def buildReader(
    +      sqlContext: SQLContext,
    +      partitionSchema: StructType,
    +      dataSchema: StructType,
    +      filters: Seq[Filter],
    +      options: Map[String, String]): PartitionedFile => Iterator[InternalRow] = {
    +    val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration)
    +    val broadcastedConf =
    +      sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf))
    +
    +    val parsedOptions: JSONOptions = new JSONOptions(options)
    +    val columnNameOfCorruptRecord = parsedOptions.columnNameOfCorruptRecord
    +      .getOrElse(sqlContext.conf.columnNameOfCorruptRecord)
    +
    +    val fullSchema = dataSchema.toAttributes ++ partitionSchema.toAttributes
    +    val joinedRow = new JoinedRow()
    +
    +    file => {
    +      val lines = new HadoopFileLinesReader(file, broadcastedConf.value.value).map(_.toString)
    +
    +      val rows = JacksonParser.parseJson(
    +        lines,
    +        dataSchema,
    --- End diff --
    
    For JSON, it's ok, the parser doesn't need the whole schema for parsing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201345555
  
    **[Test build #54180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54180/consoleFull)** for PR 11960 at commit [`3efd4a1`](https://github.com/apache/spark/commit/3efd4a11b6be933cc53c654fe88271441646a82a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileLinesReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201345705
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54180/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201587897
  
    **[Test build #54232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54232/consoleFull)** for PR 11960 at commit [`dd6cef3`](https://github.com/apache/spark/commit/dd6cef3e2bd879eb96cf48a36923715035471e00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/11960#issuecomment-201373343
  
    Overall looks good. Let's fix the serialization issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57657913
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala ---
    @@ -32,7 +32,7 @@ private[json] trait TestJsonData {
               "double":1.7976931348623157E308,
               "boolean":true,
               "null":null
    -      }"""  :: Nil)
    +      }"""  :: Nil, 1)
    --- End diff --
    
    hmmm, it's a good point, let me revert them and just remove that wrong assert.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57465871
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.net.URI
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce._
    +import org.apache.hadoop.mapreduce.lib.input.{FileSplit, LineRecordReader}
    +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    +
    +/**
    + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which are all of the lines
    + * in that file.
    + */
    +class HadoopFileLinesReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] {
    +  private val iterator = {
    +    val fileSplit = new FileSplit(
    +      new Path(new URI(file.filePath)),
    +      file.start,
    +      file.length,
    +      Array.empty)
    --- End diff --
    
    Will it be used by csv and text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14158][SQL] implement buildReader for j...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11960#discussion_r57464019
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -58,7 +58,8 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
       def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
         case PhysicalOperation(projects, filters, l@LogicalRelation(files: HadoopFsRelation, _, _))
           if (files.fileFormat.toString == "TestFileFormat" ||
    -         files.fileFormat.isInstanceOf[parquet.DefaultSource]) &&
    +         files.fileFormat.isInstanceOf[parquet.DefaultSource] ||
    +         files.fileFormat.isInstanceOf[json.DefaultSource]) &&
              files.sqlContext.conf.parquetFileScan =>
    --- End diff --
    
    This looks not right. `files.sqlContext.conf.parquetFileScan` should go with `files.fileFormat.isInstanceOf[parquet.DefaultSource]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org