You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/12/27 14:26:18 UTC

A single users view/opinion of Drill

Hi Drillers,

I have been meaning to share some thoughts on Drill for a long time and
what I, or we at Activity Stream, believe would make Drill better (for us).
Please keep in mind that this is a single sided view from a simple,
non-contributing, user and please excuse my English.

We love using Drill and our setup included Drill, Parquet, Avro, JSON, JDBC
sources and more. Drill offers many great things but in the beginning it
affected our decision to use Drill, over Presto, that we could use it with
both Hive/HDFS and local disc storage and its support for the various data
sources.

Working with Drill has not always been easy and we have spent a lot of time
adjusting to "Drill quirks", like defaulting to Double for values that show
up to late, but the "this is awesome" moments have  always been more
frequent than the "I don't believe this s**t" moments (please excuse the
language).

We see the main roles of Drill as the following:

   - Run distributed and fast SQL on top of various data sources and allow
   us to mix the data into a single result
   - Eliminate ETL by supporting evolving schema


Some discussion points:

*1. Null exists, let's use it!  (some pun intended)*

   - If a field is missing let's return Null
   - Schema validation is great but in a polyglot and mixed schema
   environment that should not surprise anyone.
   - Drill has a bunch of functions to deal with null values and

*2. String is the lowest common denominator*

   - Almost all values can be converted to and from Strings

   - Let's use Strings as the default value type if values are missing
   - Instead of Double (pet peeve)

   - Lets always convert String values automatically, if functions are
   expecting other value types and the value is applicable for conversion
   - Create a warning that this is being done when it's affecting
   performance rather than throw errors

*3. Be as tolerant towards data as possible - Log warning rather than throw
errors*

Let's minimize the "conversion boiler plaiting" needed in SQL by having a
more flexible infrastructure.

   - ISO Date-String, Time-Stamp and Long are all valid Dates, let's treat
   them as such for any function or conditions

   - Integers can be accurately converted to Real/Double, let's not make
   that difference matter (Going the other way is not the same)

   - Other pointers
      - Missing tables in a union could return empty data sets rather than
      throw errors
      - Empty files should always return empty result sets
      - Valid JSON files, starting with "[" and ending with "]" should be
      trimmed to be suitable for Drill
      - It seems odd that Drill only supports non-standard lists
      - A "incomplete last line" in a any log file (JSON, CSV etc.) should
      be ignored as it could represent an incomplete append operation
(live logs)

*4. Consistency between data/storage formats if at all possible*

Having different behavior in Parquet and Avro, for example, when it comes
to missing fields is counter intuitive and appears fragmented.

Please synchronize the way "Drill behaves" rather than fragment on how
every single format reader behaves.



This is by no means a conclusive list but I just wanted to see if I could
get this ball rolling.


Hope you are all enjoying the holidays.

Best regards,
 -Stefán

ps.
Our only contribution to Drill is this simple UDF library:
https://github.com/activitystream/asdrill (Apache license)

Re: A single users view/opinion of Drill

Posted by Matt <bs...@gmail.com>.

>  - Let's use Strings as the default value type if values are missing
>  - Instead of Double (pet peeve)

This might be more important that just a pet peeve. Is there a reason 
Drill default casts to a more restrictive data type instead of less 
restrictive strings?

On 27 Dec 2015, at 8:26, Stefán Baxter wrote:

> Hi Drillers,
>
> I have been meaning to share some thoughts on Drill for a long time 
> and
> what I, or we at Activity Stream, believe would make Drill better (for 
> us).
> Please keep in mind that this is a single sided view from a simple,
> non-contributing, user and please excuse my English.
>
> We love using Drill and our setup included Drill, Parquet, Avro, JSON, 
> JDBC
> sources and more. Drill offers many great things but in the beginning 
> it
> affected our decision to use Drill, over Presto, that we could use it 
> with
> both Hive/HDFS and local disc storage and its support for the various 
> data
> sources.
>
> Working with Drill has not always been easy and we have spent a lot of 
> time
> adjusting to "Drill quirks", like defaulting to Double for values that 
> show
> up to late, but the "this is awesome" moments have  always been more
> frequent than the "I don't believe this s**t" moments (please excuse 
> the
> language).
>
> We see the main roles of Drill as the following:
>
>  - Run distributed and fast SQL on top of various data sources and 
> allow
>  us to mix the data into a single result
>  - Eliminate ETL by supporting evolving schema
>
>
> Some discussion points:
>
> *1. Null exists, let's use it!  (some pun intended)*
>
>  - If a field is missing let's return Null
>  - Schema validation is great but in a polyglot and mixed schema
>  environment that should not surprise anyone.
>  - Drill has a bunch of functions to deal with null values and
>
> *2. String is the lowest common denominator*
>
>  - Almost all values can be converted to and from Strings
>
>  - Let's use Strings as the default value type if values are missing
>  - Instead of Double (pet peeve)
>
>  - Lets always convert String values automatically, if functions are
>  expecting other value types and the value is applicable for 
> conversion
>  - Create a warning that this is being done when it's affecting
>  performance rather than throw errors
>
> *3. Be as tolerant towards data as possible - Log warning rather than 
> throw
> errors*
>
> Let's minimize the "conversion boiler plaiting" needed in SQL by 
> having a
> more flexible infrastructure.
>
>  - ISO Date-String, Time-Stamp and Long are all valid Dates, let's 
> treat
>  them as such for any function or conditions
>
>  - Integers can be accurately converted to Real/Double, let's not make
>  that difference matter (Going the other way is not the same)
>
>  - Other pointers
>     - Missing tables in a union could return empty data sets rather 
> than
>     throw errors
>     - Empty files should always return empty result sets
>     - Valid JSON files, starting with "[" and ending with "]" should 
> be
>     trimmed to be suitable for Drill
>     - It seems odd that Drill only supports non-standard lists
>     - A "incomplete last line" in a any log file (JSON, CSV etc.) 
> should
>     be ignored as it could represent an incomplete append operation
> (live logs)
>
> *4. Consistency between data/storage formats if at all possible*
>
> Having different behavior in Parquet and Avro, for example, when it 
> comes
> to missing fields is counter intuitive and appears fragmented.
>
> Please synchronize the way "Drill behaves" rather than fragment on how
> every single format reader behaves.
>
>
>
> This is by no means a conclusive list but I just wanted to see if I 
> could
> get this ball rolling.
>
>
> Hope you are all enjoying the holidays.
>
> Best regards,
> -Stefán
>
> ps.
> Our only contribution to Drill is this simple UDF library:
> https://github.com/activitystream/asdrill (Apache license)