You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/07/15 17:56:23 UTC
Digging deeper
Hi,
We are slowly gaining some Drill/Parquet familiarity as we research it as a
potential replacement/addition for/to Druid (which we also like a lot).
We have, as stated earlier, come across many things that we like regarding
Drill/Parquet and the "speed to value" is a killer aspect when dealing with
file based data.
There are several things we need to understand better before we continue
and all assistance/feedback is appreciated for the following items.
*Combining fresh (JSON/etc.) and historical (Parquet) data*
- Is there a way to mix file types with directory queries
- parquet for processed data and JSON for fresh (badge) data waiting to
be turned into Parquet files
- Is there a recommended way to deal with streaming/fresh data
- I know that other are other tools available in this domain but I
wonder what would bee suitable for a "pure Drill" approach
*Performance and setup:*
- Under what circumstances does Drill distribute the workload to
different Drill-bits
- HDFS vs. S3
- Benefits of each approach (We were going the HDFS route but S3 seems
to be less operational "hassle")
- What is the ideal segment size when using S3?
- I have seen the HDFS config discussion and wonder what the S3
equivalent is
- Recommended setup or basic guidelines
- Are there any basic "rules" when it comes to machine
count/configuration vs. volume and load?
- Any "gotchas" regarding performance that we should be aware of?
*Drill & Parquet:*
- What version of Parquet are you using?
- What big-ish changes are required in Parquet to make Drill perform
better?
- How much effect are bloom filters expected to have on performance?
- Are you using the page indexing
- Is Histograms and HyperLogLog scheduled (I do not find it in their
Jira)
- When will Drill specific changes be merged upstream into Parquet?
- Are their any new features (that matter) in Parquet that you have not
started using?
*Drill Features (And yes, We will surely vote for these):*
- Update table vs. Create table
- add new data to existing Parquet structure (CTAS variant to add data
to existing files with same Partition by structure)
- JDB/ODBC datasources
- for dimension information from legacy systems
- We would be using Parquet+Cassandra (for now) unless you recommended
something else
- Survive unexpected EOL (incomplete files)
- disregard last in-complete JSON/CSV entry to allow querying of open
log files that are being appended to by another process
- (Perhaps a better way exist but I have been running this on live-log
files with good success :) )
I guess this is it for now :).
All the best,
-Stefan
Re: Digging deeper
Posted by Stefán Baxter <st...@activitystream.com>.
Hi again,
I was overlooking the handy UNION operator when I was noting the
combination part in my previous email.
(Feel free to ignore it)
Regards,
-Stefan
On Wed, Jul 15, 2015 at 3:56 PM, Stefán Baxter <st...@activitystream.com>
wrote:
> Hi,
>
> We are slowly gaining some Drill/Parquet familiarity as we research it as
> a potential replacement/addition for/to Druid (which we also like a lot).
>
> We have, as stated earlier, come across many things that we like regarding
> Drill/Parquet and the "speed to value" is a killer aspect when dealing with
> file based data.
>
> There are several things we need to understand better before we continue
> and all assistance/feedback is appreciated for the following items.
>
> *Combining fresh (JSON/etc.) and historical (Parquet) data*
>
> - Is there a way to mix file types with directory queries
> - parquet for processed data and JSON for fresh (badge) data waiting
> to be turned into Parquet files
>
>
> - Is there a recommended way to deal with streaming/fresh data
> - I know that other are other tools available in this domain but I
> wonder what would bee suitable for a "pure Drill" approach
>
> *Performance and setup:*
>
> - Under what circumstances does Drill distribute the workload to
> different Drill-bits
>
> - HDFS vs. S3
> - Benefits of each approach (We were going the HDFS route but S3
> seems to be less operational "hassle")
>
> - What is the ideal segment size when using S3?
> - I have seen the HDFS config discussion and wonder what the S3
> equivalent is
>
> - Recommended setup or basic guidelines
> - Are there any basic "rules" when it comes to machine
> count/configuration vs. volume and load?
>
> - Any "gotchas" regarding performance that we should be aware of?
>
>
> *Drill & Parquet:*
>
> - What version of Parquet are you using?
>
> - What big-ish changes are required in Parquet to make Drill perform
> better?
> - How much effect are bloom filters expected to have on performance?
> - Are you using the page indexing
> - Is Histograms and HyperLogLog scheduled (I do not find it in their
> Jira)
>
> - When will Drill specific changes be merged upstream into Parquet?
>
> - Are their any new features (that matter) in Parquet that you have
> not started using?
>
>
> *Drill Features (And yes, We will surely vote for these):*
>
> - Update table vs. Create table
> - add new data to existing Parquet structure (CTAS variant to add data
> to existing files with same Partition by structure)
>
> - JDB/ODBC datasources
> - for dimension information from legacy systems
> - We would be using Parquet+Cassandra (for now) unless you
> recommended something else
>
> - Survive unexpected EOL (incomplete files)
> - disregard last in-complete JSON/CSV entry to allow querying of open
> log files that are being appended to by another process
> - (Perhaps a better way exist but I have been running this on live-log
> files with good success :) )
>
>
> I guess this is it for now :).
>
> All the best,
> -Stefan
>
>