You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hövell (Jira)" <ji...@apache.org> on 2023/03/06 19:29:00 UTC
[jira] [Created] (SPARK-42690) Implement CSV/JSON parsing funcions
Herman van Hövell created SPARK-42690:
-----------------------------------------
Summary: Implement CSV/JSON parsing funcions
Key: SPARK-42690
URL: https://issues.apache.org/jira/browse/SPARK-42690
Project: Spark
Issue Type: New Feature
Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell
Implement the following two methods in DataFrameReader:
{code:java}
/**
* Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes through the
* input once to determine the input schema.
*
* @param jsonDataset input Dataset with one JSON object per record
* @since 3.4.0
*/
def json(jsonDataset: Dataset[String]): DataFrame
/**
* Loads an `Dataset[String]` storing CSV rows and returns the result as a `DataFrame`.
*
* If the schema is not specified using `schema` function and `inferSchema` option is enabled,
* this function goes through the input once to determine the input schema.
*
* If the schema is not specified using `schema` function and `inferSchema` option is disabled,
* it determines the columns as string types and it reads only the first line to determine the
* names and the number of fields.
*
* If the enforceSchema is set to `false`, only the CSV header in the first line is checked
* to conform specified or inferred schema.
*
* @note if `header` option is set to `true` when calling this API, all lines same with
* the header will be removed if exists.
*
* @param csvDataset input Dataset with one CSV row per record
* @since 3.4.0
*/
def csv(csvDataset: Dataset[String]): DataFrame
{code}
For this we need a new message. We cannot use project because we don't know the schema upfront.
{code:java}
message Parse {
// (Required) Input relation to Parse. The input is expected to have single text column.
Relation input = 1;
// (Required) The expected format of the text.
ParseFormat format = 2;
enum ParseFormat {
PARSE_FORMAT_UNSPECIFIED = 0;
PARSE_FORMAT_CSV = 1;
PARSE_FORMAT_JSON = 2;
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org