You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Jacques Nadeau (JIRA)" <ji...@apache.org> on 2013/01/11 01:52:12 UTC

[jira] [Comment Edited] (DRILL-19) Build a JSON scanner that does schema discovery

    [ https://issues.apache.org/jira/browse/DRILL-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550633#comment-13550633 ] 

Jacques Nadeau edited comment on DRILL-19 at 1/11/13 12:50 AM:
---------------------------------------------------------------

In the heterogeneous situation, you should just capture the array as type heterogeneous and then encode the schema information with each element in the array. 

Random thought, what do you think about making your schema code output a proto idl?  For the heterogeneous array option,  I'd use a type of repeated bytes with the assumption that each bytes value will be the schema followed by the data.

Yes.  Not necessarily all way to a .proto definition.  But map to those concepts.  Basically, proto is a schema definition language.  You're working on writing a schema extraction tool.  The output should preferably be expressed as a schema definition language.  It seems like proto is a reasonable one to use.  That way you can spend less effort recreating it.  

---
I was thinking about what my proto definition looks like when I have a list with maps, and so on. I was thinking that I generate a message definition at the parent level each map found in lists, however not sure what class name choice I can use to guarantee no name clash.
---

I'd suggest for naming that we just carry an incrementing integer and then name each message m##### such as m00001 and upwards.  
                
      was (Author: jnadeau):
    In the heterogeneous situation, you should just capture the array as type heterogeneous and then encode the schema information with each element in the array. 

Random thought, what do you think about making your schema code output a proto idl?  For the heterogeneous array option,  I'd use a type of repeated bytes with the assumption that each bytes value will be the schema followed by the data.
                  
> Build a JSON scanner that does schema discovery
> -----------------------------------------------
>
>                 Key: DRILL-19
>                 URL: https://issues.apache.org/jira/browse/DRILL-19
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Jacques Nadeau
>            Assignee: Timothy Chen
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira