You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/07/15 17:56:23 UTC

Digging deeper

Hi,

We are slowly gaining some Drill/Parquet familiarity as we research it as a
potential replacement/addition for/to Druid (which we also like a lot).

We have, as stated earlier, come across many things that we like regarding
Drill/Parquet and the "speed to value" is a killer aspect when dealing with
file based data.

There are several things we need to understand better before we continue
and all assistance/feedback is appreciated for the following items.

*Combining fresh (JSON/etc.) and historical (Parquet) data*

   - Is there a way to mix file types with directory queries
   - parquet for processed data and JSON for fresh (badge) data waiting to
   be turned into Parquet files


   - Is there a recommended way to deal with streaming/fresh data
   - I know that other are other tools available in this domain but I
   wonder what would bee suitable for a "pure Drill" approach

*Performance and setup:*

   - Under what circumstances does Drill distribute the workload to
   different Drill-bits

   - HDFS vs. S3
   - Benefits of each approach (We were going the HDFS  route but S3 seems
   to be less operational "hassle")

   - What is the ideal segment size when using S3?
   - I have seen the HDFS config discussion and wonder what the S3
   equivalent is

   - Recommended setup or basic guidelines
   - Are there any basic "rules" when it comes to machine
   count/configuration vs. volume and load?

   - Any "gotchas" regarding performance that we should be aware of?


*Drill & Parquet:*

   - What version of Parquet are you using?

   - What big-ish changes are required in Parquet to make Drill perform
   better?
   - How much effect are bloom filters expected to have on performance?
   - Are you using the page indexing
   - Is Histograms and HyperLogLog scheduled (I do not find it in their
   Jira)

   - When will Drill specific changes be merged upstream into Parquet?

   - Are their any new features (that matter) in Parquet that you have not
   started using?


*Drill Features (And yes, We will surely vote for these):*

   - Update table vs. Create table
   - add new data to existing Parquet structure (CTAS variant to add data
   to existing files with same Partition by structure)

   - JDB/ODBC datasources
   - for dimension information from legacy systems
   - We would be using Parquet+Cassandra  (for now) unless you recommended
   something else

   - Survive unexpected EOL (incomplete files)
   - disregard last in-complete JSON/CSV entry to allow querying of open
   log files that are being appended to by another process
   - (Perhaps a better way exist but I have been running this on live-log
   files with good success :) )


I guess this is it for now :).

All the best,
 -Stefan

Re: Digging deeper

Posted by Stefán Baxter <st...@activitystream.com>.
Hi again,

I was overlooking the handy UNION operator when I was noting the
combination part in my previous email.
(Feel free to ignore it)

Regards,
 -Stefan

On Wed, Jul 15, 2015 at 3:56 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> We are slowly gaining some Drill/Parquet familiarity as we research it as
> a potential replacement/addition for/to Druid (which we also like a lot).
>
> We have, as stated earlier, come across many things that we like regarding
> Drill/Parquet and the "speed to value" is a killer aspect when dealing with
> file based data.
>
> There are several things we need to understand better before we continue
> and all assistance/feedback is appreciated for the following items.
>
> *Combining fresh (JSON/etc.) and historical (Parquet) data*
>
>    - Is there a way to mix file types with directory queries
>    - parquet for processed data and JSON for fresh (badge) data waiting
>    to be turned into Parquet files
>
>
>    - Is there a recommended way to deal with streaming/fresh data
>    - I know that other are other tools available in this domain but I
>    wonder what would bee suitable for a "pure Drill" approach
>
> *Performance and setup:*
>
>    - Under what circumstances does Drill distribute the workload to
>    different Drill-bits
>
>    - HDFS vs. S3
>    - Benefits of each approach (We were going the HDFS  route but S3
>    seems to be less operational "hassle")
>
>    - What is the ideal segment size when using S3?
>    - I have seen the HDFS config discussion and wonder what the S3
>    equivalent is
>
>    - Recommended setup or basic guidelines
>    - Are there any basic "rules" when it comes to machine
>    count/configuration vs. volume and load?
>
>    - Any "gotchas" regarding performance that we should be aware of?
>
>
> *Drill & Parquet:*
>
>    - What version of Parquet are you using?
>
>    - What big-ish changes are required in Parquet to make Drill perform
>    better?
>    - How much effect are bloom filters expected to have on performance?
>    - Are you using the page indexing
>    - Is Histograms and HyperLogLog scheduled (I do not find it in their
>    Jira)
>
>    - When will Drill specific changes be merged upstream into Parquet?
>
>    - Are their any new features (that matter) in Parquet that you have
>    not started using?
>
>
> *Drill Features (And yes, We will surely vote for these):*
>
>    - Update table vs. Create table
>    - add new data to existing Parquet structure (CTAS variant to add data
>    to existing files with same Partition by structure)
>
>    - JDB/ODBC datasources
>    - for dimension information from legacy systems
>    - We would be using Parquet+Cassandra  (for now) unless you
>    recommended something else
>
>    - Survive unexpected EOL (incomplete files)
>    - disregard last in-complete JSON/CSV entry to allow querying of open
>    log files that are being appended to by another process
>    - (Perhaps a better way exist but I have been running this on live-log
>    files with good success :) )
>
>
> I guess this is it for now :).
>
> All the best,
>  -Stefan
>
>