You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julien Genini (JIRA)" <ji...@apache.org> on 2015/09/29 15:14:04 UTC
[jira] [Created] (SPARK-10869) Auto-normalization of
semi-structured schema from a dataframe
Julien Genini created SPARK-10869:
-------------------------------------
Summary: Auto-normalization of semi-structured schema from a dataframe
Key: SPARK-10869
URL: https://issues.apache.org/jira/browse/SPARK-10869
Project: Spark
Issue Type: New Feature
Components: PySpark
Affects Versions: 1.5.1
Reporter: Julien Genini
Priority: Minor
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.
I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels
df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)
>>
{'fields': [{'metadata': {},
'name': 'BusinessDate',
'nullable': True,
'pathName': 'SiteXML.BusinessDate',
'type': 'string'},
{'metadata': {},
'name': 'Id_Group',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.Id_Group',
'type': 'string'},
{'metadata': {},
'name': 'Id_Site',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.Id_Site',
'type': 'string'},
{'metadata': {},
'name': 'label',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.label',
'type': 'string'},
{'metadata': {},
'name': 'label_group',
'nullable': True,
'pathName': 'SiteXML.Site_List.Site.label_group',
'type': 'string'},
{'metadata': {},
'name': 'TimeStamp',
'nullable': True,
'pathName': 'SiteXML.TimeStamp',
'type': 'string'}],
'nodes': [{'name': '', 'nbFields': 3},
{'name': 'SiteXML', 'nbFields': 1},
{'name': 'SiteXML.Site_List', 'nbFields': 0},
{'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org