You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Parth Chandra <pa...@apache.org> on 2015/07/29 20:02:56 UTC

Hangout minutes - 2015-07-28

Attendees:  Andries, Daniel, Hanifi, Jacques, Jason,
Jinfeng, Khurram,  Kristine, Mehant , Neeraja, Parth, Sudheesh (host)

Minutes based on notes from Sudeesh -

1) Jacques working on the following -
      a) RPC changes - Sudheesh/Parth reported a regression in perf numbers
which was unexpected. Tests are being rerun.
      b) Apache log - format plugin.
      c) Support for Double quote.
      d) Allow JSON literals.

1) Parquet filter pushdown - Patch from Adam Gilmore is waiting review.
This patch with conflict with Steven's work on metadata caching. Metadata
caching needs to go in first.

2) JDBC storage plugin - Patch from Magnus. Parth to follow up to get
updated code.

3) Discussion on Embedded types -
   a) Two types of common problems are being hit -
        1) Soft Schema change - Lots of initial nulls and then a type
appears or the type changes to a type that can be promoted to the initial
type. Drill assumes type to be nullable int if it cannot determine the
type.Discussion on using nullable Varchar/Varbinary instead of nullable
int. Suggestion was that we need to introduce some additional types -
            i) Introduce a LATE  binding type ( type is not known).
            ii) Introduce a NULL type - only null
           iii) Schema sampling to determine schema- use for fast schema.
        2) Hard Schema Change - A schema change that is not transitionable.
   b) Open questions -    How do we materialize to the user?  How do
clients expect to handle the schema change events. What does a BI tool like
Tableau do if a new column is introduced. What is the expectation of a
JDBC/ODBC application (what do the standards specify, if anything). Neeraja
to follow up and specify.
   c) Proposal to add support for embedded types where each value carries
type information (covered in DRILL-3228) This requires a detailed design
before we begin implementation.

4) Discussion on 'Insert into' (based on Mehant's post)
   a) In general, the feature is expected to behave like in any database.
Complications arise when the user choses to insert a different schema or
partitions from the the original table.
  b) Jacques's main concern regarding this: Do we want Drill to be flexible
and be able to add columns and be able to not specify columns while
inserting or do we want it to behave like a traditional Data Warehouse
where we do ordinal matching and are strict about the number of columns
being inserted into the target table.
   c) We should validate the schema where we can (eg parquet), however we
should start by validating metadata for queries and use that feature in
Insert as opposed to building that in Insert.
   d) If we allow insert into with a different schema and we cannot read
the file, then that would be embarrassing.
   e) If we are trying to solve a specific BI tool use case for inserts
then we should explore going down the route of solving this specific use
case, and treat the insert like CTAS today.


5) Discussion on 'Drop table'
  a) Strict identification of table - Don't drop tables that Drill can't
query.
  b) Fail if there is a file that does not match.
  c) If no impersonation is enabled then drop only drill owned tables.

   More detailed notes on #4 and #5 to be posted by Jacques.

Re: Hangout minutes - 2015-07-28

Posted by Jacques Nadeau <ja...@dremio.com>.
My key notes:

INSERT INTO
This functionality should be consistent how Drill views data and the world.
It seems like there are a number of missing foundational components that
should be built before this could work "right".

First steps that should be done before INSERT INTO, EMBEDDED and DROP:

   - Add a null or nullable any type to the execution flow.  Don't
   materialize this until necessary (convert new field existence to soft
   schema change from current hard schema change).
   - Add a INSERT INTO that works like CTAS.  This is the simplest way to
   start and bumps many decisions to later. This will also solve the top two
   use cases: BI tool temporary table and advanced user use cases (but is a
   bit of sharp instrument)
   - Implement "thorough table identification".  Right now a directory of
   multiple types of files (potentially queryable and non-queryable)
   - Add support for Parquet schema reading, merging and validation.
   Parquet has schema, Drill shouldn't expose as schemaless.  This will set
   the groundwork for a number of types of validation around INSERT INTO,
   DROP, etc. (It will also require a full deconflicting between implicit
   casting behavior for the validator and the execution layer.)
   - Start planning around "dot drill" (a.k.a. DRILL-3572).  Many of things
   that need to supported to make these things work "like a database" require
   this.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Jul 29, 2015 at 11:02 AM, Parth Chandra <pa...@apache.org> wrote:

> Attendees:  Andries, Daniel, Hanifi, Jacques, Jason,
> Jinfeng, Khurram,  Kristine, Mehant , Neeraja, Parth, Sudheesh (host)
>
> Minutes based on notes from Sudeesh -
>
> 1) Jacques working on the following -
>       a) RPC changes - Sudheesh/Parth reported a regression in perf numbers
> which was unexpected. Tests are being rerun.
>       b) Apache log - format plugin.
>       c) Support for Double quote.
>       d) Allow JSON literals.
>
> 1) Parquet filter pushdown - Patch from Adam Gilmore is waiting review.
> This patch with conflict with Steven's work on metadata caching. Metadata
> caching needs to go in first.
>
> 2) JDBC storage plugin - Patch from Magnus. Parth to follow up to get
> updated code.
>
> 3) Discussion on Embedded types -
>    a) Two types of common problems are being hit -
>         1) Soft Schema change - Lots of initial nulls and then a type
> appears or the type changes to a type that can be promoted to the initial
> type. Drill assumes type to be nullable int if it cannot determine the
> type.Discussion on using nullable Varchar/Varbinary instead of nullable
> int. Suggestion was that we need to introduce some additional types -
>             i) Introduce a LATE  binding type ( type is not known).
>             ii) Introduce a NULL type - only null
>            iii) Schema sampling to determine schema- use for fast schema.
>         2) Hard Schema Change - A schema change that is not transitionable.
>    b) Open questions -    How do we materialize to the user?  How do
> clients expect to handle the schema change events. What does a BI tool like
> Tableau do if a new column is introduced. What is the expectation of a
> JDBC/ODBC application (what do the standards specify, if anything). Neeraja
> to follow up and specify.
>    c) Proposal to add support for embedded types where each value carries
> type information (covered in DRILL-3228) This requires a detailed design
> before we begin implementation.
>
> 4) Discussion on 'Insert into' (based on Mehant's post)
>    a) In general, the feature is expected to behave like in any database.
> Complications arise when the user choses to insert a different schema or
> partitions from the the original table.
>   b) Jacques's main concern regarding this: Do we want Drill to be flexible
> and be able to add columns and be able to not specify columns while
> inserting or do we want it to behave like a traditional Data Warehouse
> where we do ordinal matching and are strict about the number of columns
> being inserted into the target table.
>    c) We should validate the schema where we can (eg parquet), however we
> should start by validating metadata for queries and use that feature in
> Insert as opposed to building that in Insert.
>    d) If we allow insert into with a different schema and we cannot read
> the file, then that would be embarrassing.
>    e) If we are trying to solve a specific BI tool use case for inserts
> then we should explore going down the route of solving this specific use
> case, and treat the insert like CTAS today.
>
>
> 5) Discussion on 'Drop table'
>   a) Strict identification of table - Don't drop tables that Drill can't
> query.
>   b) Fail if there is a file that does not match.
>   c) If no impersonation is enabled then drop only drill owned tables.
>
>    More detailed notes on #4 and #5 to be posted by Jacques.
>