You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hövell (Jira)" <ji...@apache.org> on 2023/03/06 19:29:00 UTC
[jira] [Created] (SPARK-42690) Implement CSV/JSON parsing funcions

Herman van Hövell created SPARK-42690:
-----------------------------------------

             Summary: Implement CSV/JSON parsing funcions
                 Key: SPARK-42690
                 URL: https://issues.apache.org/jira/browse/SPARK-42690
             Project: Spark
          Issue Type: New Feature
          Components: Connect
    Affects Versions: 3.4.0
            Reporter: Herman van Hövell


Implement the following two methods in DataFrameReader:

 

 
{code:java}
/**
* Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes through the
* input once to determine the input schema.
*
* @param jsonDataset input Dataset with one JSON object per record
* @since 3.4.0
*/
def json(jsonDataset: Dataset[String]): DataFrame
/**
* Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`.
*
* If the schema is not specified using `schema` function and `inferSchema` option is enabled,
* this function goes through the input once to determine the input schema.
*
* If the schema is not specified using `schema` function and `inferSchema` option is disabled,
* it determines the columns as string types and it reads only the first line to determine the
* names and the number of fields.
*
* If the enforceSchema is set to `false`, only the CSV header in the first line is checked
* to conform specified or inferred schema.
*
* @note if `header` option is set to `true` when calling this API, all lines same with
* the header will be removed if exists.
*
* @param csvDataset input Dataset with one CSV row per record
* @since 3.4.0
*/
def csv(csvDataset: Dataset[String]): DataFrame
{code}
 

For this we need a new message. We cannot use project because we don't know the schema upfront.

 
{code:java}
message Parse {
  // (Required) Input relation to Parse. The input is expected to have single text column.
  Relation input = 1;
  // (Required) The expected format of the text.
  ParseFormat format = 2;
  enum ParseFormat {
    PARSE_FORMAT_UNSPECIFIED = 0;
    PARSE_FORMAT_CSV = 1;
    PARSE_FORMAT_JSON = 2;
  }
}
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org