You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Jacques Nadeau <ja...@dremio.com> on 2015/10/01 19:37:52 UTC

Re: Making parquet data available to Tableau

Hey Chris,

Yes, this is definitely something that should be part of Drill but isn't
yet. REFRESH TABLE METADATA won't actually resolve the issue. There are
three separate pieces that will need to get resolved:

- Automatically scan workspace directories for known files/tables as part
of show tables/information schema (probably only workspace root files and
one directory down unless a ctas was done)
- Use parquet footer reading to expose table definition and validate query
and functions.
- Update Drill to merge different parquet schemas using schemaless rules
(and add appropriate projections to scans as necessary)

We should put together some jiras for each of these and get them done.

Thanks for bringing this up, it is something that should be fixed.


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews <ma...@uk2.net> wrote:

> Hi
>
> Being new to Drill I am working on a capabilities study to store telecoms
> probe data as parquet files on an HDFS server, for later analysis and
> visualisation using Tableau Desktop/Server with Drill and Zookeeper via
> ODBC/JDBC etc.
>
> We store the parquet files on the HDFS server using an in-house ETL
> platform, which amongst other things transforms the massive volumes of
> telecoms probe data into millions of parquet files, writing out the parquet
> files directly to HDFS using AvroParquetWriter. The probe data arrives at
> regular intervals (5 to 15 minutes; configurable), so for performance
> reasons we use this direct AvroParquetWriter approach rather than writing
> out intermediate files and loading them via the Drill CTAS route.
>
> There has been some success, together with some frustration. After
> extensive experimentation we have come to the conclusion that to access
> these parquet files using Tableau we have to configure Drill with
> individual views for each parquet schema, and cast the columns to specific
> data types before Tableau can access the data correctly.
>
> This is a surprise as I thought Drill would have some way of exporting the
> schemas to Tableau having defined AVRO schemas for each parquet file, and
> the parquet files storing the schema as part of the data.  We now find we
> have to generate schema definitions in AVRO for the AvroParquetWriter
> phase, and also a Drill view for each schema to make them visible to
> Tableau.
>
> Also, as part of our experimentation we did create some parquet files
> using CTAS. The directory is created and the files contain the data but the
> tables do not seem to be displayed when we do a SHOW TABLES command.
>
> Are we correct in our thinking about Tableau requiring views to be
> created, or have we missed something obvious here ?
>
> Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2 ?)
> help us when it becomes available ?
>
> Help and suggestions much appreciated.
>
> Cheers -- Chris
>
>

Re: Making parquet data available to Tableau

Posted by Boris Chmiel <bo...@yahoo.com.INVALID>.

Hi all,As part of our experimentations and use cases, we found that the need of creating a view on top of each parquet file containing all the metadata already specifyed earlier as a source limitation in the process of making data available to Tableau. Using the parquet footer would considered to be a huge improvement in this area.RegsBoris 


     Le Jeudi 1 octobre 2015 19h38, Jacques Nadeau <ja...@dremio.com> a écrit :
   

 Hey Chris,

Yes, this is definitely something that should be part of Drill but isn't
yet. REFRESH TABLE METADATA won't actually resolve the issue. There are
three separate pieces that will need to get resolved:

- Automatically scan workspace directories for known files/tables as part
of show tables/information schema (probably only workspace root files and
one directory down unless a ctas was done)
- Use parquet footer reading to expose table definition and validate query
and functions.
- Update Drill to merge different parquet schemas using schemaless rules
(and add appropriate projections to scans as necessary)

We should put together some jiras for each of these and get them done.

Thanks for bringing this up, it is something that should be fixed.


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews <ma...@uk2.net> wrote:

> Hi
>
> Being new to Drill I am working on a capabilities study to store telecoms
> probe data as parquet files on an HDFS server, for later analysis and
> visualisation using Tableau Desktop/Server with Drill and Zookeeper via
> ODBC/JDBC etc.
>
> We store the parquet files on the HDFS server using an in-house ETL
> platform, which amongst other things transforms the massive volumes of
> telecoms probe data into millions of parquet files, writing out the parquet
> files directly to HDFS using AvroParquetWriter. The probe data arrives at
> regular intervals (5 to 15 minutes; configurable), so for performance
> reasons we use this direct AvroParquetWriter approach rather than writing
> out intermediate files and loading them via the Drill CTAS route.
>
> There has been some success, together with some frustration. After
> extensive experimentation we have come to the conclusion that to access
> these parquet files using Tableau we have to configure Drill with
> individual views for each parquet schema, and cast the columns to specific
> data types before Tableau can access the data correctly.
>
> This is a surprise as I thought Drill would have some way of exporting the
> schemas to Tableau having defined AVRO schemas for each parquet file, and
> the parquet files storing the schema as part of the data.  We now find we
> have to generate schema definitions in AVRO for the AvroParquetWriter
> phase, and also a Drill view for each schema to make them visible to
> Tableau.
>
> Also, as part of our experimentation we did create some parquet files
> using CTAS. The directory is created and the files contain the data but the
> tables do not seem to be displayed when we do a SHOW TABLES command.
>
> Are we correct in our thinking about Tableau requiring views to be
> created, or have we missed something obvious here ?
>
> Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2 ?)
> help us when it becomes available ?
>
> Help and suggestions much appreciated.
>
> Cheers -- Chris
>
>