You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by ddrinka <gi...@git.apache.org> on 2018/09/13 23:11:26 UTC

[GitHub] orc pull request #308: Deliver a lower-case schema to OrcFile

GitHub user ddrinka opened a pull request:

    https://github.com/apache/orc/pull/308

    Deliver a lower-case schema to OrcFile

    Mixed-case struct field names don't work in Hive.  There should be a way to convert a camel-cased JSON document into ORC without having to pre-process the JSON.
    
    This pull request is a proof-of-concept which generates two schemas, one using the default case which is provided to the JsonReader as usual, and another schema which is lower cased and is provided to OrcFile.
    
    TypeDescription is immutable and non-trivial to manually clone using public accessors, so to make the idea clear, I do the conversion at schema ingest rather than where it's provided to OrcFile.  The downside of this approach is that automatic schema detection doesn't benefit from these changes.  A more experienced implementer could certainly do better.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ddrinka/orc ddrinka-pr-lowercase-schema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/308.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #308
    
----
commit cc7e909725d059b69f9a8c384aca2691b52ce0ff
Author: Douglas Drinka <dd...@...>
Date:   2018-09-13T22:59:11Z

    Deliver a lower-case schema to OrcFile

----


---

[GitHub] orc issue #308: Covert tool should create a lowercase schema

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/308
  
    Actually, this is doesn't match my experience. Hive doesn't support uppercase letters in the top level column names, but it does support them in sub-structs. Furthermore, ORC from other systems can handle the uppercase letters.
    
    You might also try setting "orc.schema.evolution.case.sensitive" to false to get name-based matching of the ORC types that isn't case sensitive.
    
    In terms of this patch, can you add a parameter that allows the user to downcase the column names?


---

[GitHub] orc issue #308: Covert tool should create a lowercase schema

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/308
  
    Thinking about this more, I'd like to suggest that if the user asks for a lowercase schema that when the fields of a struct are parsed, they are lowercased there. That will prevent "aa" and "AA" both being converted to "aa", but not being merged into a single field.


---