You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Jesse Yates (JIRA)" <ji...@apache.org> on 2016/04/18 19:55:25 UTC

[jira] [Commented] (DRILL-4615) Support directory names in schema

    [ https://issues.apache.org/jira/browse/DRILL-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246159#comment-15246159 ] 

Jesse Yates commented on DRILL-4615:
------------------------------------

I imagine this can be handled with an optional flag and a column/field separator, which seems easy enough to slide in. However, I'm not terribly familiar with the Drill code, so any pointers as to where to start would be great.

It seems like the ParquetGroupScan is already too late in the pipeline, but I'm not sure where else we can put this.

> Support directory names in schema
> ---------------------------------
>
>                 Key: DRILL-4615
>                 URL: https://issues.apache.org/jira/browse/DRILL-4615
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Jesse Yates
>
> In Spark, partitioned parquet output is written with directories like:
> {code}
> /column1=1
>   /column2=hello
>      /data.parquet
>   /column2=world
>      /moredata.parquet
> /column1=2
> {code}
> However, when querying these files with Drill we end up interpreting the directories as strings when what they really are is column names + values. In the data files we only have the remaining columns. Querying this with drill means that you can really only have a couple of data types (far short of what spark/parquet supports) in the column and still have correct operations.
> Given the size of the data, I don't want to have to CTAS all the parquet files (especially as they are being periodically updated). 
> I think this ends up being a nice addition for general file directory reads as well since many people already encode meaning into their directory structure, but having self describing directories is even better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)