You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by falaki <gi...@git.apache.org> on 2016/01/06 08:05:17 UTC

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

GitHub user falaki opened a pull request:

    https://github.com/apache/spark/pull/10615

    [SPARK-12420][SQL] Have a built-in CSV data source implementation

    CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL.
    
    This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests.
    
    Spark-csv was developed and maintained by several members of the open source community:
    @dtpeacock: Type inference
    @mohitjaggi: Integration with uniVocity-parsers 
    @JoshRosen: Build and style checking 
    @aley: Support for comments
    @pashields: Support for compression codecs
    @HyukjinKwon: Several bug fixes
    @rbolkey: Tests and improvements
    @huangjs: Updating API
    @vlyubin: Support for insert
    @brkyvz: Test refactoring
    @rxin: Documentation
    @andy327: Null values
    @yhuai: Documentation
    @akirakw: Documentation
    @dennishuo: Documentation
    @petro-rudenko: Increasing max characters per-column
    @saurfang: Documentation
    @kmader: Tests
    @cvengros: Documentation
    @MarkRijckenberg: Documentation
    @msperlich: Improving compression codec handling
    @thoralf-gutierrez: Documentation
    @lebigot: Documentation
    @sryza: Python documentation
    @xguo27: Documentation
    @darabos: License text in build file
    @jamesblau: Nullable quote character
    @karma243: Java documentation
    @gasparms: Improving double and float type cast
    @MarcinKosinski: R documentation

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/falaki/spark SPARK-12420

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10615.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10615
    
----
commit f3e99bde657ece010929e04f622ccdf75588af0d
Author: Hossein <ho...@databricks.com>
Date:   2016-01-06T00:40:13Z

    Added univocity-parsers as a dependency

commit 29c15c84b511363f47dca327f791f2c5e28ffcea
Author: Hossein <ho...@databricks.com>
Date:   2016-01-06T00:42:20Z

    Added inline implementation of spark-csv in SparkSQL

commit c9900d800fddb69a74f54dcd3b1dfc0afea8e8ee
Author: Hossein <ho...@databricks.com>
Date:   2016-01-06T00:48:07Z

    Minor style and comments with some TODOs

commit da314cb9cb323b5800175e15a49fe48f5c5c5e75
Author: Hossein <ho...@databricks.com>
Date:   2016-01-06T06:37:30Z

    Ported tests from spark-csv

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169285920
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by falaki <gi...@git.apache.org>.

Github user falaki commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169490033
  
    @steveloughran this adds uniVocity-parsers as a dependency to SparkSQL. With assembly jar distribution, that jar will be included. If/When Spark moves away from assembly jar distribution, the uniVocity-parsers jar needs to be in classpath same way other dependencies should be.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r49826113
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala ---
    @@ -0,0 +1,243 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.io.{OutputStreamWriter, ByteArrayOutputStream, StringReader}
    +
    +import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, CsvParser, CsvWriter}
    +
    +import org.apache.spark.Logging
    +
    +/**
    +  * Read and parse CSV-like input
    +  *
    +  * @param params Parameters object
    +  * @param headers headers for the columns
    +  */
    +private[sql] abstract class CsvReader(params: CSVParameters, headers: Seq[String]) {
    +
    +  protected lazy val parser: CsvParser = {
    +    val settings = new CsvParserSettings()
    +    val format = settings.getFormat
    +    format.setDelimiter(params.delimiter)
    +    format.setLineSeparator(params.rowSeparator)
    +    format.setQuote(params.quote)
    +    format.setQuoteEscape(params.escape)
    +    format.setComment(params.comment)
    +    settings.setIgnoreLeadingWhitespaces(params.ignoreLeadingWhiteSpaceFlag)
    +    settings.setIgnoreTrailingWhitespaces(params.ignoreTrailingWhiteSpaceFlag)
    +    settings.setReadInputOnSeparateThread(false)
    +    settings.setInputBufferSize(params.inputBufferSize)
    +    settings.setMaxColumns(params.maxColumns)
    +    settings.setNullValue(params.nullValue)
    +    settings.setMaxCharsPerColumn(params.maxCharsPerColumn)
    +    if (headers != null) settings.setHeaders(headers: _*)
    +
    +    new CsvParser(settings)
    +  }
    +}
    +
    +/**
    +  * Converts a sequence of string to CSV string
    +  *
    +  * @param params Parameters object for configuration
    +  * @param headers headers for columns
    +  */
    +private[sql] class LineCsvWriter(params: CSVParameters, headers: Seq[String]) extends Logging {
    +  private val writerSettings = new CsvWriterSettings
    +  private val format = writerSettings.getFormat
    +
    +  format.setDelimiter(params.delimiter)
    +  format.setLineSeparator(params.rowSeparator)
    +  format.setQuote(params.quote)
    +  format.setQuoteEscape(params.escape)
    +  format.setComment(params.comment)
    +
    +  writerSettings.setNullValue(params.nullValue)
    +  writerSettings.setEmptyValue(params.nullValue)
    +  writerSettings.setSkipEmptyLines(true)
    +  writerSettings.setQuoteAllFields(false)
    +  writerSettings.setHeaders(headers: _*)
    +
    +  def writeRow(row: Seq[String], includeHeader: Boolean): String = {
    +    val buffer = new ByteArrayOutputStream()
    +    val outputWriter = new OutputStreamWriter(buffer)
    +    val writer = new CsvWriter(outputWriter, writerSettings)
    +
    +    if (includeHeader) {
    +      writer.writeHeaders()
    +    }
    +    writer.writeRow(row.toArray: _*)
    +    writer.close()
    +    buffer.toString.stripLineEnd
    +  }
    +}
    +
    +/**
    +  * Parser for parsing a line at a time. Not efficient for bulk data.
    +  *
    +  * @param params Parameters object
    +  */
    +private[sql] class LineCsvReader(params: CSVParameters)
    +  extends CsvReader(params, null) {
    +  /**
    +    * parse a line
    +    *
    +    * @param line a String with no newline at the end
    +    * @return array of strings where each string is a field in the CSV record
    +    */
    +  def parseLine(line: String): Array[String] = {
    +    parser.beginParsing(new StringReader(line))
    +    val parsed = parser.parseNext()
    +    parser.stopParsing()
    +    parsed
    +  }
    +}
    +
    +/**
    +  * Parser for parsing lines in bulk. Use this when efficiency is desired.
    +  *
    +  * @param iter iterator over lines in the file
    +  * @param params Parameters object
    +  * @param headers headers for the columns
    +  */
    +private[sql] class BulkCsvReader(
    +    iter: Iterator[String],
    +    params: CSVParameters,
    +    headers: Seq[String])
    +  extends CsvReader(params, headers) with Iterator[Array[String]] {
    +
    +  private val reader = new StringIteratorReader(iter)
    +  parser.beginParsing(reader)
    +  private var nextRecord = parser.parseNext()
    +
    +  /**
    +    * get the next parsed line.
    +    * @return array of strings where each string is a field in the CSV record
    +    */
    +  override def next(): Array[String] = {
    +    val curRecord = nextRecord
    +    if(curRecord != null) {
    +      nextRecord = parser.parseNext()
    +    } else {
    +      throw new NoSuchElementException("next record is null")
    +    }
    +    curRecord
    +  }
    +
    +  override def hasNext: Boolean = nextRecord != null
    +
    +}
    +
    +/**
    +  * A Reader that "reads" from a sequence of lines. Spark's textFile method removes newlines at
    +  * end of each line Univocity parser requires a Reader that provides access to the data to be
    +  * parsed and needs the newlines to be present
    +  * @param iter iterator over RDD[String]
    +  */
    +private class StringIteratorReader(val iter: Iterator[String]) extends java.io.Reader {
    +
    +  private var next: Long = 0
    +  private var length: Long = 0  // length of input so far
    +  private var start: Long = 0
    +  private var str: String = null   // current string from iter
    +
    +  /**
    +    * fetch next string from iter, if done with current one
    +    * pretend there is a new line at the end of every string we get from from iter
    +    */
    +  private def refill(): Unit = {
    +    if (length == next) {
    +      if (iter.hasNext) {
    +        str = iter.next()
    +        start = length
    +        length += (str.length + 1) // allowance for newline removed by SparkContext.textFile()
    +      } else {
    +        str = null
    +      }
    +    }
    +  }
    +
    +  /**
    +    * read the next character, if at end of string pretend there is a new line
    +    */
    +  override def read(): Int = {
    +    refill()
    +    if (next >= length) {
    +      -1
    +    } else {
    +      val cur = next - start
    +      next += 1
    +      if (cur == str.length) '\n' else str.charAt(cur.toInt)
    +    }
    +  }
    +
    +  /**
    +    * read from str into cbuf
    +    */
    +  override def read(cbuf: Array[Char], off: Int, len: Int): Int = {
    +    refill()
    +    var n = 0
    +    if ((off < 0) || (off > cbuf.length) || (len < 0) ||
    +      ((off + len) > cbuf.length) || ((off + len) < 0)) {
    +      throw new IndexOutOfBoundsException()
    +    } else if (len == 0) {
    +      n = 0
    +    } else {
    +      if (next >= length) {   // end of input
    +        n = -1
    +      } else {
    +        n = Math.min(length - next, len).toInt // lesser of amount of input available or buf size
    +        if (n == length - next) {
    +          str.getChars((next - start).toInt, (next - start + n - 1).toInt, cbuf, off)
    --- End diff --
    
    does this actually do anything?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169453637
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48869/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169486663
  
    **[Test build #48879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48879/consoleFull)** for PR 10615 at commit [`1856ed3`](https://github.com/apache/spark/commit/1856ed33dc4b677b0f3c83f61c100640c3f8e801).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169548304
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169500494
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169253681
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169573499
  
    **[Test build #48905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48905/consoleFull)** for PR 10615 at commit [`1e312a5`](https://github.com/apache/spark/commit/1e312a525c85ec08f2aa76870fe812716f6699a0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r49377643
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -0,0 +1,228 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.math.BigDecimal
    +import java.sql.{Date, Timestamp}
    +import java.text.NumberFormat
    +import java.util.Locale
    +
    +
    +import scala.util.Try
    +import scala.util.control.Exception._
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.types._
    +import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion
    +
    +private[sql] object CSVInferSchema {
    +
    +  /**
    +    * Similar to the JSON schema inference
    +    *     1. Infer type of each row
    +    *     2. Merge row types to find common type
    +    *     3. Replace any null types with string type
    +    * TODO(hossein): Can we reuse JSON schema inference? [SPARK-12670]
    +    */
    +  def apply(
    +      tokenRdd: RDD[Array[String]],
    +      header: Array[String],
    +      nullValue: String = ""): StructType = {
    +
    +    val startType: Array[DataType] = Array.fill[DataType](header.length)(NullType)
    +    val rootTypes: Array[DataType] =
    +      tokenRdd.aggregate(startType)(inferRowType(nullValue), mergeRowTypes)
    +
    +    val structFields = header.zip(rootTypes).map { case (thisHeader, rootType) =>
    +      StructField(thisHeader, rootType, nullable = true)
    +    }
    +
    +    StructType(structFields)
    +  }
    +
    +  private def inferRowType(nullValue: String)
    +      (rowSoFar: Array[DataType], next: Array[String]): Array[DataType] = {
    +    var i = 0
    +    while (i < math.min(rowSoFar.length, next.length)) {  // May have columns on right missing.
    +      rowSoFar(i) = inferField(rowSoFar(i), next(i), nullValue)
    +      i+=1
    +    }
    +    rowSoFar
    +  }
    +
    +  private[csv] def mergeRowTypes(
    +      first: Array[DataType],
    +      second: Array[DataType]): Array[DataType] = {
    +
    +    first.zipAll(second, NullType, NullType).map { case ((a, b)) =>
    +      val tpe = findTightestCommonType(a, b).getOrElse(StringType)
    +      tpe match {
    +        case _: NullType => StringType
    +        case other => other
    +      }
    +    }
    +  }
    +
    +  /**
    +    * Infer type of string field. Given known type Double, and a string "1", there is no
    +    * point checking if it is an Int, as the final type must be Double or higher.
    +    */
    +  private[csv] def inferField(
    +      typeSoFar: DataType, field: String, nullValue: String = ""): DataType = {
    +    if (field == null || field.isEmpty || field == nullValue) {
    +      typeSoFar
    +    } else {
    +      typeSoFar match {
    +        case NullType => tryParseInteger(field)
    +        case IntegerType => tryParseInteger(field)
    +        case LongType => tryParseLong(field)
    +        case DoubleType => tryParseDouble(field)
    +        case TimestampType => tryParseTimestamp(field)
    +        case StringType => StringType
    +        case other: DataType =>
    +          throw new UnsupportedOperationException(s"Unexpected data type $other")
    +      }
    +    }
    +  }
    +
    +
    +  private def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) {
    +    IntegerType
    +  } else {
    +    tryParseLong(field)
    +  }
    +
    +  private def tryParseLong(field: String): DataType = if ((allCatch opt field.toLong).isDefined) {
    +    LongType
    +  } else {
    +    tryParseDouble(field)
    +  }
    +
    +  private def tryParseDouble(field: String): DataType = {
    +    if ((allCatch opt field.toDouble).isDefined) {
    +      DoubleType
    +    } else {
    +      tryParseTimestamp(field)
    +    }
    +  }
    +
    +  def tryParseTimestamp(field: String): DataType = {
    +    if ((allCatch opt Timestamp.valueOf(field)).isDefined) {
    +      TimestampType
    +    } else {
    +      stringType()
    +    }
    +  }
    +
    +  // Defining a function to return the StringType constant is necessary in order to work around
    +  // a Scala compiler issue which leads to runtime incompatibilities with certain Spark versions;
    +  // see issue #128 for more details.
    +  private def stringType(): DataType = {
    +    StringType
    +  }
    +
    +  private val numericPrecedence: IndexedSeq[DataType] = HiveTypeCoercion.numericPrecedence
    +
    +  /**
    +    * Copied from internal Spark api
    +    * [[org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion]]
    +    */
    +  val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
    +    case (t1, t2) if t1 == t2 => Some(t1)
    +    case (NullType, t1) => Some(t1)
    +    case (t1, NullType) => Some(t1)
    +
    +    // Promote numeric types to the highest of the two and all numeric types to unlimited decimal
    +    case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
    +      val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2)
    +      Some(numericPrecedence(index))
    +
    +    case _ => None
    +  }
    +}
    +
    +object CSVTypeCast {
    +
    +  /**
    +   * Casts given string datum to specified type.
    +   * Currently we do not support complex types (ArrayType, MapType, StructType).
    +   *
    +   * For string types, this is simply the datum. For other types.
    +   * For other nullable types, this is null if the string datum is empty.
    +   *
    +   * @param datum string value
    +   * @param castType SparkSQL type
    +   */
    +  private[csv] def castTo(
    --- End diff --
    
    We should probably do this with expressions now that we are in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169548127
  
    **[Test build #48883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48883/consoleFull)** for PR 10615 at commit [`0fd4bd3`](https://github.com/apache/spark/commit/0fd4bd3cd177e23c46db56b2a08a12b85c57355f).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r49147704
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -0,0 +1,341 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.nio.charset.UnsupportedCharsetException
    +import java.io.File
    +import java.sql.Timestamp
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.sql.{DataFrame, QueryTest, Row}
    +import org.apache.spark.sql.test.{SQLTestUtils, SharedSQLContext}
    +import org.apache.spark.sql.types._
    +
    +class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
    +  private val carsFile = "cars.csv"
    +  private val carsFile8859 = "cars_iso-8859-1.csv"
    +  private val carsTsvFile = "cars.tsv"
    +  private val carsAltFile = "cars-alternative.csv"
    +  private val carsUnbalancedQuotesFile = "cars-unbalanced-quotes.csv"
    +  private val carsNullFile = "cars-null.csv"
    +  private val emptyFile = "empty.csv"
    +  private val commentsFile = "comments.csv"
    +  private val disableCommentsFile = "disable_comments.csv"
    +
    +  private def testFile(fileName: String): String = {
    +    Thread.currentThread().getContextClassLoader.getResource(fileName).toString
    +  }
    +
    +  /** Verifies data and schema. */
    +  private def verifyCars(
    +      df: DataFrame,
    +      withHeader: Boolean,
    +      numCars: Int = 3,
    +      numFields: Int = 5,
    +      checkHeader: Boolean = true,
    +      checkValues: Boolean = true,
    +      checkTypes: Boolean = false): Unit = {
    +
    +    val numColumns = numFields
    +    val numRows = if (withHeader) numCars else numCars + 1
    +    // schema
    +    assert(df.schema.fieldNames.length === numColumns)
    +    assert(df.collect().length === numRows)
    +
    +    if (checkHeader) {
    +      if (withHeader) {
    +        assert(df.schema.fieldNames === Array("year", "make", "model", "comment", "blank"))
    +      } else {
    +        assert(df.schema.fieldNames === Array("C0", "C1", "C2", "C3", "C4"))
    +      }
    +    }
    +
    +    if (checkValues) {
    +      val yearValues = List("2012", "1997", "2015")
    +      val actualYears = if (!withHeader) "year" :: yearValues else yearValues
    +      val years = if (withHeader) df.select("year").collect() else df.select("C0").collect()
    +
    +      years.zipWithIndex.foreach { case (year, index) =>
    +        if (checkTypes) {
    +          assert(year === Row(actualYears(index).toInt))
    +        } else {
    +          assert(year === Row(actualYears(index)))
    +        }
    +      }
    +    }
    +  }
    +
    +  test("simple csv test") {
    +    val cars = sqlContext
    +      .read
    +      .format("csv")
    +      .option("header", "false")
    +      .load(testFile(carsFile))
    +
    +    verifyCars(cars, withHeader = false, checkTypes = false)
    +  }
    +
    +  test("simple csv test with type inference") {
    +    val cars = sqlContext
    +      .read
    +      .format("csv")
    +      .option("header", "true")
    +      .option("inferSchema", "true")
    +      .load(testFile(carsFile))
    +
    +    verifyCars(cars, withHeader = true, checkTypes = true)
    +  }
    +
    +  test("test with alternative delimiter and quote") {
    +    val cars = sqlContext.read
    +      .format("csv")
    +      .options(Map("quote" -> "\'", "delimiter" -> "|", "header" -> "true"))
    +      .load(testFile(carsAltFile))
    +
    +    verifyCars(cars, withHeader = true)
    +  }
    +
    +  test("bad encoding name") {
    +    val exception = intercept[UnsupportedCharsetException] {
    +      sqlContext
    +        .read
    +        .format("csv")
    +        .option("charset", "1-9588-osi")
    +        .load(testFile(carsFile8859))
    +    }
    +
    +    assert(exception.getMessage.contains("1-9588-osi"))
    +  }
    +
    +  ignore("test different encoding") {
    +    // scalastyle:off
    +    sqlContext.sql(
    +      s"""
    +         |CREATE TEMPORARY TABLE carsTable USING csv
    +         |OPTIONS (path "${testFile(carsFile8859)}", header "true",
    +         |charset "iso-8859-1", delimiter "þ")
    +      """.stripMargin.replaceAll("\n", " "))
    +    //scalstyle:on
    +
    +    verifyCars(sqlContext.table("carsTable"), withHeader = true)
    +  }
    +
    +  test("DDL test with tab separated file") {
    +    sqlContext.sql(
    +      s"""
    +         |CREATE TEMPORARY TABLE carsTable USING csv
    +         |OPTIONS (path "${testFile(carsTsvFile)}", header "true", delimiter "\t")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    verifyCars(sqlContext.table("carsTable"), numFields = 6, withHeader = true, checkHeader = false)
    +  }
    +
    +  test("DDL test parsing decimal type") {
    +    sqlContext.sql(
    +      s"""
    +         |CREATE TEMPORARY TABLE carsTable
    +         |(yearMade double, makeName string, modelName string, priceTag decimal,
    +         | comments string, grp string)
    +         |USING csv
    +         |OPTIONS (path "${testFile(carsTsvFile)}", header "true", delimiter "\t")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    assert(
    +      sqlContext.sql("SELECT makeName FROM carsTable where priceTag > 60000").collect().size === 1)
    +  }
    +
    +  test("test for DROPMALFORMED parsing mode") {
    +    val cars = sqlContext.read
    +      .format("csv")
    +      .options(Map("header" -> "true", "mode" -> "dropmalformed"))
    +      .load(testFile(carsFile))
    +
    +    assert(cars.select("year").collect().size === 2)
    +  }
    +
    +  test("test for FAILFAST parsing mode") {
    +    val exception = intercept[SparkException]{
    +      sqlContext.read
    +      .format("csv")
    +      .options(Map("header" -> "true", "mode" -> "failfast"))
    +      .load(testFile(carsFile)).collect()
    +    }
    +
    +    assert(exception.getMessage.contains("Malformed line in FAILFAST mode: 2015,Chevy,Volt"))
    +  }
    +
    +  test("test with null quote character") {
    +    val cars = sqlContext.read
    +      .format("csv")
    +      .option("header", "true")
    +      .option("quote", "")
    +      .load(testFile(carsUnbalancedQuotesFile))
    +
    +    verifyCars(cars, withHeader = true, checkValues = false)
    +
    +  }
    +
    +  test("test with empty file and known schema") {
    +    val result = sqlContext.read
    +      .format("csv")
    +      .schema(StructType(List(StructField("column", StringType, false))))
    +      .load(testFile(emptyFile))
    +
    +    assert(result.collect.size === 0)
    +    assert(result.schema.fieldNames.size === 1)
    +  }
    +
    +
    +  test("DDL test with empty file") {
    +    sqlContext.sql(s"""
    +           |CREATE TEMPORARY TABLE carsTable
    +           |(yearMade double, makeName string, modelName string, comments string, grp string)
    +           |USING csv
    +           |OPTIONS (path "${testFile(emptyFile)}", header "false")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    assert(sqlContext.sql("SELECT count(*) FROM carsTable").collect().head(0) === 0)
    +  }
    +
    +  test("DDL test with schema") {
    +    sqlContext.sql(s"""
    +           |CREATE TEMPORARY TABLE carsTable
    +           |(yearMade double, makeName string, modelName string, comments string, blank string)
    +           |USING csv
    +           |OPTIONS (path "${testFile(carsFile)}", header "true")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    val cars = sqlContext.table("carsTable")
    +    verifyCars(cars, withHeader = true, checkHeader = false, checkValues = false)
    +    assert(
    +      cars.schema.fieldNames === Array("yearMade", "makeName", "modelName", "comments", "blank"))
    +  }
    +
    +  test("save csv") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +      val cars = sqlContext.read
    +        .format("csv")
    +        .option("header", "true")
    +        .load(testFile(carsFile))
    +
    +      cars.repartition(1).write
    +        .format("csv")
    +        .option("header", "true")
    +        .save(csvDir)
    +
    +      val carsCopy = sqlContext.read
    +        .format("csv")
    +        .option("header", "true")
    +        .load(csvDir)
    +
    +      verifyCars(carsCopy, withHeader = true)
    +    }
    +  }
    +
    +  test("save csv with quote") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +      val cars = sqlContext.read
    +        .format("csv")
    +        .option("header", "true")
    +        .load(testFile(carsFile))
    +
    +      cars.repartition(1).write
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169548309
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48883/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169610548
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48905/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169850601
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48975/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169285294
  
    **[Test build #48852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48852/consoleFull)** for PR 10615 at commit [`b31cb89`](https://github.com/apache/spark/commit/b31cb893dfcd87d1269a4a932d34fed830fe55ce).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169284984
  
    **[Test build #48852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48852/consoleFull)** for PR 10615 at commit [`b31cb89`](https://github.com/apache/spark/commit/b31cb893dfcd87d1269a4a932d34fed830fe55ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r49147677
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -0,0 +1,341 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.nio.charset.UnsupportedCharsetException
    +import java.io.File
    +import java.sql.Timestamp
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.sql.{DataFrame, QueryTest, Row}
    +import org.apache.spark.sql.test.{SQLTestUtils, SharedSQLContext}
    +import org.apache.spark.sql.types._
    +
    +class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
    +  private val carsFile = "cars.csv"
    +  private val carsFile8859 = "cars_iso-8859-1.csv"
    +  private val carsTsvFile = "cars.tsv"
    +  private val carsAltFile = "cars-alternative.csv"
    +  private val carsUnbalancedQuotesFile = "cars-unbalanced-quotes.csv"
    +  private val carsNullFile = "cars-null.csv"
    +  private val emptyFile = "empty.csv"
    +  private val commentsFile = "comments.csv"
    +  private val disableCommentsFile = "disable_comments.csv"
    +
    +  private def testFile(fileName: String): String = {
    +    Thread.currentThread().getContextClassLoader.getResource(fileName).toString
    +  }
    +
    +  /** Verifies data and schema. */
    +  private def verifyCars(
    +      df: DataFrame,
    +      withHeader: Boolean,
    +      numCars: Int = 3,
    +      numFields: Int = 5,
    +      checkHeader: Boolean = true,
    +      checkValues: Boolean = true,
    +      checkTypes: Boolean = false): Unit = {
    +
    +    val numColumns = numFields
    +    val numRows = if (withHeader) numCars else numCars + 1
    +    // schema
    +    assert(df.schema.fieldNames.length === numColumns)
    +    assert(df.collect().length === numRows)
    +
    +    if (checkHeader) {
    +      if (withHeader) {
    +        assert(df.schema.fieldNames === Array("year", "make", "model", "comment", "blank"))
    +      } else {
    +        assert(df.schema.fieldNames === Array("C0", "C1", "C2", "C3", "C4"))
    +      }
    +    }
    +
    +    if (checkValues) {
    +      val yearValues = List("2012", "1997", "2015")
    +      val actualYears = if (!withHeader) "year" :: yearValues else yearValues
    +      val years = if (withHeader) df.select("year").collect() else df.select("C0").collect()
    +
    +      years.zipWithIndex.foreach { case (year, index) =>
    +        if (checkTypes) {
    +          assert(year === Row(actualYears(index).toInt))
    +        } else {
    +          assert(year === Row(actualYears(index)))
    +        }
    +      }
    +    }
    +  }
    +
    +  test("simple csv test") {
    +    val cars = sqlContext
    +      .read
    +      .format("csv")
    +      .option("header", "false")
    +      .load(testFile(carsFile))
    +
    +    verifyCars(cars, withHeader = false, checkTypes = false)
    +  }
    +
    +  test("simple csv test with type inference") {
    +    val cars = sqlContext
    +      .read
    +      .format("csv")
    +      .option("header", "true")
    +      .option("inferSchema", "true")
    +      .load(testFile(carsFile))
    +
    +    verifyCars(cars, withHeader = true, checkTypes = true)
    +  }
    +
    +  test("test with alternative delimiter and quote") {
    +    val cars = sqlContext.read
    +      .format("csv")
    +      .options(Map("quote" -> "\'", "delimiter" -> "|", "header" -> "true"))
    +      .load(testFile(carsAltFile))
    +
    +    verifyCars(cars, withHeader = true)
    +  }
    +
    +  test("bad encoding name") {
    +    val exception = intercept[UnsupportedCharsetException] {
    +      sqlContext
    +        .read
    +        .format("csv")
    +        .option("charset", "1-9588-osi")
    +        .load(testFile(carsFile8859))
    +    }
    +
    +    assert(exception.getMessage.contains("1-9588-osi"))
    +  }
    +
    +  ignore("test different encoding") {
    +    // scalastyle:off
    +    sqlContext.sql(
    +      s"""
    +         |CREATE TEMPORARY TABLE carsTable USING csv
    +         |OPTIONS (path "${testFile(carsFile8859)}", header "true",
    +         |charset "iso-8859-1", delimiter "þ")
    +      """.stripMargin.replaceAll("\n", " "))
    +    //scalstyle:on
    +
    +    verifyCars(sqlContext.table("carsTable"), withHeader = true)
    +  }
    +
    +  test("DDL test with tab separated file") {
    +    sqlContext.sql(
    +      s"""
    +         |CREATE TEMPORARY TABLE carsTable USING csv
    +         |OPTIONS (path "${testFile(carsTsvFile)}", header "true", delimiter "\t")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    verifyCars(sqlContext.table("carsTable"), numFields = 6, withHeader = true, checkHeader = false)
    +  }
    +
    +  test("DDL test parsing decimal type") {
    +    sqlContext.sql(
    +      s"""
    +         |CREATE TEMPORARY TABLE carsTable
    +         |(yearMade double, makeName string, modelName string, priceTag decimal,
    +         | comments string, grp string)
    +         |USING csv
    +         |OPTIONS (path "${testFile(carsTsvFile)}", header "true", delimiter "\t")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    assert(
    +      sqlContext.sql("SELECT makeName FROM carsTable where priceTag > 60000").collect().size === 1)
    +  }
    +
    +  test("test for DROPMALFORMED parsing mode") {
    +    val cars = sqlContext.read
    +      .format("csv")
    +      .options(Map("header" -> "true", "mode" -> "dropmalformed"))
    +      .load(testFile(carsFile))
    +
    +    assert(cars.select("year").collect().size === 2)
    +  }
    +
    +  test("test for FAILFAST parsing mode") {
    +    val exception = intercept[SparkException]{
    +      sqlContext.read
    +      .format("csv")
    +      .options(Map("header" -> "true", "mode" -> "failfast"))
    +      .load(testFile(carsFile)).collect()
    +    }
    +
    +    assert(exception.getMessage.contains("Malformed line in FAILFAST mode: 2015,Chevy,Volt"))
    +  }
    +
    +  test("test with null quote character") {
    +    val cars = sqlContext.read
    +      .format("csv")
    +      .option("header", "true")
    +      .option("quote", "")
    +      .load(testFile(carsUnbalancedQuotesFile))
    +
    +    verifyCars(cars, withHeader = true, checkValues = false)
    +
    +  }
    +
    +  test("test with empty file and known schema") {
    +    val result = sqlContext.read
    +      .format("csv")
    +      .schema(StructType(List(StructField("column", StringType, false))))
    +      .load(testFile(emptyFile))
    +
    +    assert(result.collect.size === 0)
    +    assert(result.schema.fieldNames.size === 1)
    +  }
    +
    +
    +  test("DDL test with empty file") {
    +    sqlContext.sql(s"""
    +           |CREATE TEMPORARY TABLE carsTable
    +           |(yearMade double, makeName string, modelName string, comments string, grp string)
    +           |USING csv
    +           |OPTIONS (path "${testFile(emptyFile)}", header "false")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    assert(sqlContext.sql("SELECT count(*) FROM carsTable").collect().head(0) === 0)
    +  }
    +
    +  test("DDL test with schema") {
    +    sqlContext.sql(s"""
    +           |CREATE TEMPORARY TABLE carsTable
    +           |(yearMade double, makeName string, modelName string, comments string, blank string)
    +           |USING csv
    +           |OPTIONS (path "${testFile(carsFile)}", header "true")
    +      """.stripMargin.replaceAll("\n", " "))
    +
    +    val cars = sqlContext.table("carsTable")
    +    verifyCars(cars, withHeader = true, checkHeader = false, checkValues = false)
    +    assert(
    +      cars.schema.fieldNames === Array("yearMade", "makeName", "modelName", "comments", "blank"))
    +  }
    +
    +  test("save csv") {
    +    withTempDir { dir =>
    +      val csvDir = new File(dir, "csv").getCanonicalPath
    +      val cars = sqlContext.read
    +        .format("csv")
    +        .option("header", "true")
    +        .load(testFile(carsFile))
    +
    +      cars.repartition(1).write
    --- End diff --
    
    A trivial and ignorable stuff though.
    `coalesce(1)` maybe..?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169253676
  
    **[Test build #48842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48842/consoleFull)** for PR 10615 at commit [`da314cb`](https://github.com/apache/spark/commit/da314cb9cb323b5800175e15a49fe48f5c5c5e75).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169855191
  
    Cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169453618
  
    **[Test build #48869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48869/consoleFull)** for PR 10615 at commit [`e364c28`](https://github.com/apache/spark/commit/e364c284f2d37540aa2487220b417fa433198361).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169610545
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169500341
  
    **[Test build #48879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48879/consoleFull)** for PR 10615 at commit [`1856ed3`](https://github.com/apache/spark/commit/1856ed33dc4b677b0f3c83f61c100640c3f8e801).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r48932909
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -0,0 +1,231 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.math.BigDecimal
    +import java.sql.{Date, Timestamp}
    +import java.text.NumberFormat
    +import java.util.Locale
    +
    +
    +import scala.util.Try
    +import scala.util.control.Exception._
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.types._
    +
    +private[sql] object CSVInferSchema {
    +
    +  /**
    +    * Similar to the JSON schema inference. [[org.apache.spark.sql.json.InferSchema]]
    +    *     1. Infer type of each row
    +    *     2. Merge row types to find common type
    +    *     3. Replace any null types with string type
    +    */
    +  def apply(tokenRdd: RDD[Array[String]], header: Array[String]): StructType = {
    +
    +    val startType: Array[DataType] = Array.fill[DataType](header.length)(NullType)
    +    val rootTypes: Array[DataType] = tokenRdd.aggregate(startType)(inferRowType, mergeRowTypes)
    +
    +    val stuctFields = header.zip(rootTypes).map { case (thisHeader, rootType) =>
    --- End diff --
    
    And here `stuctFields` maybe for `structFields`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by falaki <gi...@git.apache.org>.

Github user falaki closed the pull request at:

    https://github.com/apache/spark/pull/10615


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169825583
  
    **[Test build #48975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48975/consoleFull)** for PR 10615 at commit [`319e0ed`](https://github.com/apache/spark/commit/319e0edb17d02eb994bc1cd104a29df8c47a9c59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-171883737
  
    **[Test build #2384 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2384/consoleFull)** for PR 10615 at commit [`319e0ed`](https://github.com/apache/spark/commit/319e0edb17d02eb994bc1cd104a29df8c47a9c59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169253615
  
    **[Test build #48842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48842/consoleFull)** for PR 10615 at commit [`da314cb`](https://github.com/apache/spark/commit/da314cb9cb323b5800175e15a49fe48f5c5c5e75).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169253683
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48842/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by mohitjaggi <gi...@git.apache.org>.

Github user mohitjaggi commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169852430
  
    this is great...thanks @falaki 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169285302
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48852/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-171884149
  
    **[Test build #2384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2384/consoleFull)** for PR 10615 at commit [`319e0ed`](https://github.com/apache/spark/commit/319e0edb17d02eb994bc1cd104a29df8c47a9c59).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169285300
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169285927
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48851/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r49147610
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala ---
    @@ -0,0 +1,243 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.io.{OutputStreamWriter, ByteArrayOutputStream, StringReader}
    +
    +import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, CsvParser, CsvWriter}
    +
    +import org.apache.spark.Logging
    +
    +/**
    +  * Read and parse CSV-like input
    +  *
    +  * @param params Parameters object
    +  * @param headers headers for the columns
    +  */
    +private[sql] abstract class CsvReader(params: CSVParameters, headers: Seq[String]) {
    +
    +  protected lazy val parser: CsvParser = {
    +    val settings = new CsvParserSettings()
    +    val format = settings.getFormat
    +    format.setDelimiter(params.delimiter)
    +    format.setLineSeparator(params.rowSeparator)
    +    format.setQuote(params.quote)
    +    format.setQuoteEscape(params.escape)
    +    format.setComment(params.comment)
    +    settings.setIgnoreLeadingWhitespaces(params.ignoreLeadingWhiteSpaceFlag)
    +    settings.setIgnoreTrailingWhitespaces(params.ignoreTrailingWhiteSpaceFlag)
    +    settings.setReadInputOnSeparateThread(false)
    +    settings.setInputBufferSize(params.inputBufferSize)
    +    settings.setMaxColumns(params.maxColumns)
    +    settings.setNullValue(params.nullValue)
    +    settings.setMaxCharsPerColumn(params.maxCharsPerColumn)
    +    if (headers != null) settings.setHeaders(headers: _*)
    +
    +    new CsvParser(settings)
    +  }
    +}
    +
    +/**
    +  * Converts a sequence of string to CSV string
    +  *
    +  * @param params Parameters object for configuration
    +  * @param headers headers for columns
    +  */
    +private[sql] class LineCsvWriter(params: CSVParameters, headers: Seq[String]) extends Logging {
    +  private val writerSettings = new CsvWriterSettings
    +  private val format = writerSettings.getFormat
    +
    +  format.setDelimiter(params.delimiter)
    +  format.setLineSeparator(params.rowSeparator)
    +  format.setQuote(params.quote)
    +  format.setQuoteEscape(params.escape)
    +  format.setComment(params.comment)
    +
    +  writerSettings.setNullValue(params.nullValue)
    +  writerSettings.setEmptyValue(params.nullValue)
    +  writerSettings.setSkipEmptyLines(true)
    +  writerSettings.setQuoteAllFields(false)
    +  writerSettings.setHeaders(headers: _*)
    +
    +  def writeRow(row: Seq[String], includeHeader: Boolean): String = {
    +    val buffer = new ByteArrayOutputStream()
    +    val outputWriter = new OutputStreamWriter(buffer)
    +    val writer = new CsvWriter(outputWriter, writerSettings)
    +
    +    if (includeHeader) {
    +      writer.writeHeaders()
    +    }
    +    writer.writeRow(row.toArray: _*)
    +    writer.close()
    +    buffer.toString.stripLineEnd
    +  }
    +}
    +
    +/**
    +  * Parser for parsing a line at a time. Not efficient for bulk data.
    +  *
    +  * @param params Parameters object
    +  */
    +private[sql] class LineCsvReader(params: CSVParameters)
    +  extends CsvReader(params, null) {
    +  /**
    +    * parse a line
    +    *
    +    * @param line a String with no newline at the end
    +    * @return array of strings where each string is a field in the CSV record
    +    */
    +  def parseLine(line: String): Array[String] = {
    +    parser.beginParsing(new StringReader(line))
    +    val parsed = parser.parseNext()
    +    parser.stopParsing()
    +    parsed
    +  }
    +}
    +
    +/**
    +  * Parser for parsing lines in bulk. Use this when efficiency is desired.
    +  *
    +  * @param iter iterator over lines in the file
    +  * @param params Parameters object
    +  * @param headers headers for the columns
    +  */
    +private[sql] class BulkCsvReader(
    +    iter: Iterator[String],
    +    params: CSVParameters,
    +    headers: Seq[String])
    +  extends CsvReader(params, headers) with Iterator[Array[String]] {
    +
    +  private val reader = new StringIteratorReader(iter)
    +  parser.beginParsing(reader)
    +  private var nextRecord = parser.parseNext()
    +
    +  /**
    +    * get the next parsed line.
    +    * @return array of strings where each string is a field in the CSV record
    +    */
    +  override def next(): Array[String] = {
    +    val curRecord = nextRecord
    +    if(curRecord != null) {
    +      nextRecord = parser.parseNext()
    +    } else {
    +      throw new NoSuchElementException("next record is null")
    +    }
    +    curRecord
    +  }
    +
    +  override def hasNext: Boolean = nextRecord != null
    +
    +}
    +
    +/**
    +  * A Reader that "reads" from a sequence of lines. Spark's textFile method removes newlines at
    +  * end of each line Univocity parser requires a Reader that provides access to the data to be
    +  * parsed and needs the newlines to be present
    +  * @param iter iterator over RDD[String]
    +  */
    +private class StringIteratorReader(val iter: Iterator[String]) extends java.io.Reader {
    +
    +  private var next: Long = 0
    +  private var length: Long = 0  // length of input so far
    +  private var start: Long = 0
    +  private var str: String = null   // current string from iter
    +
    +  /**
    +    * fetch next string from iter, if done with current one
    +    * pretend there is a new line at the end of every string we get from from iter
    +    */
    +  private def refill(): Unit = {
    +    if (length == next) {
    +      if (iter.hasNext) {
    +        str = iter.next()
    +        start = length
    +        length += (str.length + 1) // allowance for newline removed by SparkContext.textFile()
    +      } else {
    +        str = null
    +      }
    +    }
    +  }
    +
    +  /**
    +    * read the next character, if at end of string pretend there is a new line
    +    */
    +  override def read(): Int = {
    +    refill()
    +    if (next >= length) {
    +      -1
    +    } else {
    +      val cur = next - start
    +      next += 1
    +      if (cur == str.length) '\n' else str.charAt(cur.toInt)
    +    }
    +  }
    +
    +  /**
    +    * read from str into cbuf
    +    */
    +  override def read(cbuf: Array[Char], off: Int, len: Int): Int = {
    +    refill()
    +    var n = 0
    +    if ((off < 0) || (off > cbuf.length) || (len < 0) ||
    +      ((off + len) > cbuf.length) || ((off + len) < 0)) {
    +      throw new IndexOutOfBoundsException()
    +    } else if (len == 0) {
    +      n = 0
    +    } else {
    +      if (next >= length) {   // end of input
    +        n = -1
    +      } else {
    +        n = Math.min(length - next, len).toInt // lesser of amount of input available or buf size
    +        if (n == length - next) {
    +          str.getChars((next - start).toInt, (next - start + n - 1).toInt, cbuf, off)
    +          cbuf(off + n - 1) = '\n'
    +        } else {
    +          str.getChars((next - start).toInt, (next - start + n).toInt, cbuf, off)
    +        }
    +        next += n
    +        if (n < len) {
    +          val m = read(cbuf, off + n, len - n)  // have more space, fetch more input from iter
    +          if(m != -1) n += m
    +        }
    +      }
    +    }
    +
    --- End diff --
    
    I think this is a trivial and ignorable stuff though.
    If you are going to submit more commits, maybe this newline might better be removed.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169610229
  
    **[Test build #48905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48905/consoleFull)** for PR 10615 at commit [`1e312a5`](https://github.com/apache/spark/commit/1e312a525c85ec08f2aa76870fe812716f6699a0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169850361
  
    **[Test build #48975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48975/consoleFull)** for PR 10615 at commit [`319e0ed`](https://github.com/apache/spark/commit/319e0edb17d02eb994bc1cd104a29df8c47a9c59).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169453630
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r48932885
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -0,0 +1,231 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.math.BigDecimal
    +import java.sql.{Date, Timestamp}
    +import java.text.NumberFormat
    +import java.util.Locale
    +
    +
    +import scala.util.Try
    +import scala.util.control.Exception._
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.types._
    +
    +private[sql] object CSVInferSchema {
    +
    +  /**
    +    * Similar to the JSON schema inference. [[org.apache.spark.sql.json.InferSchema]]
    --- End diff --
    
    Oh, actually I found several trivial typos. Here just a broken link here from `[[org.apache.spark.sql.json.InferSchema]]` to `[[org.apache.spark.sql.execution.datasources.json.InferSchema]]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10615#discussion_r48933011
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala ---
    @@ -0,0 +1,305 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.csv
    +
    +import java.nio.charset.Charset
    +
    +import org.apache.hadoop.fs.{FileStatus, Path}
    +import org.apache.hadoop.io.{Text, NullWritable, LongWritable}
    +import org.apache.hadoop.mapred.TextInputFormat
    +import org.apache.hadoop.mapreduce.RecordWriter
    +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
    +
    +import com.google.common.base.Objects
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.sources._
    +import org.apache.hadoop.mapreduce.{TaskAttemptContext, Job}
    +import org.apache.spark.sql.types._
    +
    +private[csv] class CSVRelation(
    +    private val inputRDD: Option[RDD[String]],
    +    override val paths: Array[String],
    +    private val maybeDataSchema: Option[StructType],
    +    override val userDefinedPartitionColumns: Option[StructType],
    +    private val parameters: Map[String, String])
    +    (@transient val sqlContext: SQLContext) extends HadoopFsRelation with Serializable {
    +
    +
    +  override lazy val dataSchema: StructType = maybeDataSchema match {
    +    case Some(structType) => structType
    +    case None => inferSchema(paths)
    +  }
    +
    +  private val params = new CSVParameters(parameters)
    +
    +  @transient
    +  private var cachedRDD: Option[RDD[String]] = None
    +
    +  private def readText(location: String): RDD[String] = {
    +    if (Charset.forName(params.charset) == Charset.forName("UTF-8")) {
    +      sqlContext.sparkContext.textFile(location)
    +    } else {
    +      sqlContext.sparkContext.hadoopFile[LongWritable, Text, TextInputFormat](location)
    +        .mapPartitions { _.map { pair =>
    +            new String(pair._2.getBytes, 0, pair._2.getLength, params.charset)
    +          }
    +        }
    +    }
    +  }
    +
    +  private def baseRdd(inputPaths: Array[String]): RDD[String] = {
    +    inputRDD.getOrElse {
    +      cachedRDD.getOrElse {
    +        val rdd = readText(inputPaths.mkString(","))
    +        cachedRDD = Some(rdd)
    +        rdd
    +      }
    +    }
    +  }
    +
    +  private def tokenRdd(header: Array[String], inputPaths: Array[String]): RDD[Array[String]] = {
    +    val rdd = baseRdd(inputPaths)
    +    // Make sure firstLine is materialized before sending to executors
    +    val firstLine = if (params.headerFlag) findFirstLine(rdd) else null
    +    CSVRelation.univocityTokenizer(rdd, header, firstLine, params)
    +  }
    +
    +  /**
    +    * This supports to eliminate unneeded columns before producing an RDD
    +    * containing all of its tuples as Row objects. This reads all the tokens of each line
    +    * and then drop unneeded tokens without casting and type-checking by mapping
    +    * both the indices produced by `requiredColumns` and the ones of tokens.
    +    * TODO: Switch to using buildInternalScan
    +    */
    +  override def buildScan(requiredColumns: Array[String], inputs: Array[FileStatus]): RDD[Row] = {
    +    val pathsString = inputs.map(_.getPath.toUri.toString)
    +    val header = schema.fields.map(_.name)
    +    val tokenizedRdd = tokenRdd(header, pathsString)
    +    CSVRelation.parseCsv(tokenizedRdd, schema, requiredColumns, inputs, sqlContext, params)
    +  }
    +
    +  override def prepareJobForWrite(job: Job): OutputWriterFactory = {
    +    new CSVOutputWriterFactory(params)
    +  }
    +
    +  override def hashCode(): Int = Objects.hashCode(paths.toSet, dataSchema, schema, partitionColumns)
    +
    +  override def equals(other: Any): Boolean = other match {
    +    case that: CSVRelation => {
    +      val equalPath = paths.toSet == that.paths.toSet
    +      val equalDataSchema = dataSchema == that.dataSchema
    +      val equalSchema = schema == that.schema
    +      val equalPartitionColums = partitionColumns == that.partitionColumns
    +
    +      equalPath && equalDataSchema && equalSchema && equalPartitionColums
    +    }
    +    case _ => false
    +  }
    +
    +  private def inferSchema(paths: Array[String]): StructType = {
    +    val rdd = baseRdd(Array(paths.head))
    +    val firstLine = findFirstLine(rdd)
    +    val firstRow = new LineCsvReader(params).parseLine(firstLine)
    +
    +    val header = if (params.headerFlag) {
    +      firstRow
    +    } else {
    +      firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +    }
    +
    +    val parsedRdd = tokenRdd(header, paths)
    +    if (params.inferSchemaFlag) {
    +      CSVInferSchema(parsedRdd, header)
    +    } else {
    +      // By default fields are assumed to be StringType
    +      val schemaFields = header.map { fieldName =>
    +        StructField(fieldName.toString, StringType, nullable = true)
    +      }
    +      StructType(schemaFields)
    +    }
    +  }
    +
    +  /**
    +    * Returns the first line of the first non-empty file in path
    +    */
    +  private def findFirstLine(rdd: RDD[String]): String = {
    +    if (params.isCommentSet) {
    +      rdd.take(params.MAX_COMMENT_LINES_IN_HEADER)
    +        .find(!_.startsWith(params.comment.toString))
    +        .getOrElse(sys.error(s"No uncommented header line in " +
    +          s"first ${params.MAX_COMMENT_LINES_IN_HEADER} lines"))
    +    } else {
    +      rdd.first()
    +    }
    +  }
    +}
    +
    +object CSVRelation extends Logging {
    +
    +  def univocityTokenizer(
    +      file: RDD[String],
    +      header: Seq[String],
    +      firstLine: String,
    +      params: CSVParameters): RDD[Array[String]] = {
    +    // If header is set, make sure firstLine is materialized before sending to executors.
    +    file.mapPartitionsWithIndex({
    +      case (split, iter) => new BulkCsvReader(
    +        if (params.headerFlag) iter.filterNot(_ == firstLine) else iter,
    +        params,
    +        headers = header)
    +    }, true)
    +  }
    +
    +
    +  def parseCsv(
    +      tokenizedRDD: RDD[Array[String]],
    +      schema: StructType,
    +      requiredColumns: Array[String],
    +      inputs: Array[FileStatus],
    +      sqlContext: SQLContext,
    +      params: CSVParameters): RDD[Row] = {
    +
    +    val schemaFields = schema.fields
    +    val requiredFields = StructType(requiredColumns.map(schema(_))).fields
    +    val safeRequiredFields = if (params.dropMalformed) {
    +      // If `dropMalformed` is enabled, then it needs to parse all the values
    +      // so that we can decide which row is malformed.
    +      requiredFields ++ schemaFields.filterNot(requiredFields.contains(_))
    +    } else {
    +      requiredFields
    +    }
    +    //println("safeRequiredFields: " + safeRequiredFields.mkString("\n\t"))
    --- End diff --
    
    Lastly, here I believe this comment is meant to be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169451909
  
    **[Test build #48869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48869/consoleFull)** for PR 10615 at commit [`e364c28`](https://github.com/apache/spark/commit/e364c284f2d37540aa2487220b417fa433198361).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by falaki <gi...@git.apache.org>.

Github user falaki commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-172107705
  
    This PR was merged with https://github.com/apache/spark/pull/10766


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169314710
  
    Is this going to require the new parser JAR on the classpath everywhere, or will everything excluding CSV parsing still work without it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169500498
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48879/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169508841
  
    **[Test build #48883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48883/consoleFull)** for PR 10615 at commit [`0fd4bd3`](https://github.com/apache/spark/commit/0fd4bd3cd177e23c46db56b2a08a12b85c57355f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10615#issuecomment-169850598
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org