You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Burak Yavuz (Jira)" <ji...@apache.org> on 2019/12/23 17:21:00 UTC

[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

     [ https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Burak Yavuz updated SPARK-30334:
--------------------------------
    Description: 
Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
+------------+-------+--------------------+
|     ts     | event |        raw         |
+------------+-------+--------------------+
| 2019-10-12 | click | {"field":"value"}  |
+------------+-------+--------------------+ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
+------------+-------+----------------------+
|     ts     | event |         raw          |
+------------+-------+----------------------+
| 2019-10-12 | click | field1=v1|field2=v2  |
+------------+-------+----------------------+ {code}
SELECT raw.field1 FROM data

will return v1.

 

As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns.

  was:
Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
+------------+-------+--------------------+
|     ts     | event |        raw         |
+------------+-------+--------------------+
| 2019-10-12 | click | {"field":"value"}  |
+------------+-------+--------------------+ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
+------------+-------+----------------------+
|     ts     | event |         raw          |
+------------+-------+----------------------+
| 2019-10-12 | click | field1=v1|field2=v2  |
+------------+-------+----------------------+ {code}
SELECT raw.field1 FROM data

will return v1.


> Add metadata around semi-structured columns to Spark
> ----------------------------------------------------
>
>                 Key: SPARK-30334
>                 URL: https://issues.apache.org/jira/browse/SPARK-30334
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.4.4
>            Reporter: Burak Yavuz
>            Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> +------------+-------+--------------------+
> |     ts     | event |        raw         |
> +------------+-------+--------------------+
> | 2019-10-12 | click | {"field":"value"}  |
> +------------+-------+--------------------+ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> +------------+-------+----------------------+
> |     ts     | event |         raw          |
> +------------+-------+----------------------+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> +------------+-------+----------------------+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org