You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Parth Chandra (JIRA)" <ji...@apache.org> on 2015/03/05 00:58:39 UTC
[jira] [Updated] (DRILL-2265) Drill data exploration function for complex data types

     [ https://issues.apache.org/jira/browse/DRILL-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Parth Chandra updated DRILL-2265:
---------------------------------
    Fix Version/s:     (was: 0.9.0)
                   Future

> Drill data exploration function for complex data types
> ------------------------------------------------------
>
>                 Key: DRILL-2265
>                 URL: https://issues.apache.org/jira/browse/DRILL-2265
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>            Reporter: Andries Engelbrecht
>            Assignee: Daniel Barclay (Drill)
>             Fix For: Future
>
>
> Drill data exploration function for complex data types
> When dealing with complex data in large volumes it will be extremely useful to have a function to collect metadata to provide a better view of the total data set.
> If JSON is used as an example a data set can have an extremely large volume of JSON objects. Each object can have multiple schemas and subschemas with multiple nested subschemas as well as arrays. Not all objects will have all of the schemas or subschemas. When exploring this data in Drill a SQL dot notation is used to navigate the complex subschema structure, and it can become very cumbersome to fully understand the total picture of all the data.
> A function that can explore the JSON objects in a data set (whether single file with multiple objects, single or multilevel directory structure) and provide the total structure of all the JSON objects to show all schema, subschema and arrays that are available for all the JSON objects. This way a data analyst will be able to see within the data set all the schema data that is available. Additionally if the function can provide the statistics information to show how many of the objects actually contain each of the schemas, subschemas and arrays (and data in each), this may indicate to an analyst how valuable or important in may be to explore any subschema or array.
> To speed up the collection of this data, the function may contain an option to set a sample size to only sample a portion of the total volume and project the total data set. This is a very common operation being used with prominent RDBMS systems today. Additionally for data that changes or grows the metadata collection function will need to be run periodically to update the statistics.
> To make the metadata more useful the results should be considered to be placed in a Drill metadata structure, similar to INFORMATION_SCHEMA, but specifically for statistics metadata only to be used by analysts for data exploration. Some security considerations should also be deigned to only allow access to users with access to the base data.
> In addition to the use for data analyst and data exploration the metadata and statistics can also be used for Drill internal functions in the future, such as query optimization and creation of views.
> This example specifically focusses on JSON data, but can similarly be applied to other complex data types that may require a very detailed understanding of the complex data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)