You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/04/08 21:50:51 UTC

[GitHub] [incubator-druid] gianm commented on issue #7134: overhaul 'druid-orc-extensions' and move from 'contrib' to 'core'

gianm commented on issue #7134: overhaul 'druid-orc-extensions' and move from 'contrib' to 'core'
URL: https://github.com/apache/incubator-druid/issues/7134#issuecomment-481021756

@clintropolis,

> The primary incompatibility is that the typeString of the current extension allows arbitrary renaming of columns in the ORC file, as only position and type seem to be significant. I'm not certain but I presume the reason this is allowed is detailed in these Hive issues HIVE-7189 and HIVE-4243 which chronicle how Hive would write ORC files without their real column names, instead just using _col0, _col1, etc. However, flattenSpec expressions would be a way to handle this with the new extension, as the fields could be extracted from the generic name _col0 or whatever into the desired column name manually. If we feel that we really need to continue to support the old way the extension worked, I could investigate possible mechanisms to retain this functionality of providing a typeString schema and extracting column names from it, but I don't feel that it is necessary.

I agree here - in my experience writing the typeString is tough for people, and making it optional is a good thing. It sounds like flattenSpecs are going to be able to do all the stuff that typeStrings used to be able to do, which is nice because it's better aligned with how JSON/Avro/Parquet work.

> Another incompatibility would be related to how the current extension handles OrcMap types. It provides a type of flattening 'magic' for maps of primitives that appear in the row with a name controlled by mapFieldNameFormat. Since the new extension would use flattenSpec, these names could be replicated to preserve existing Druid schemas with field extraction expressions.

I feel similarly about this as the typeString thing. The move to 'core' seems like a good time to better align ORC with how other input formats work.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org