You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julien Genini (JIRA)" <ji...@apache.org> on 2015/09/29 15:14:04 UTC

[jira] [Created] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

Julien Genini created SPARK-10869:
-------------------------------------

             Summary: Auto-normalization of semi-structured schema from a dataframe
                 Key: SPARK-10869
                 URL: https://issues.apache.org/jira/browse/SPARK-10869
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 1.5.1
            Reporter: Julien Genini
            Priority: Minor


today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)

>>
{'fields': [{'metadata': {},                                                    
             'name': 'BusinessDate',
             'nullable': True,
             'pathName': 'SiteXML.BusinessDate',
             'type': 'string'},
            {'metadata': {},
             'name': 'Id_Group',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.Id_Group',
             'type': 'string'},
            {'metadata': {},
             'name': 'Id_Site',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.Id_Site',
             'type': 'string'},
            {'metadata': {},
             'name': 'label',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.label',
             'type': 'string'},
            {'metadata': {},
             'name': 'label_group',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.label_group',
             'type': 'string'},
            {'metadata': {},
             'name': 'TimeStamp',
             'nullable': True,
             'pathName': 'SiteXML.TimeStamp',
             'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
           {'name': 'SiteXML', 'nbFields': 1},
           {'name': 'SiteXML.Site_List', 'nbFields': 0},
           {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org