You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Dan Blondowski <da...@dhigroupinc.com> on 2020/01/21 13:08:05 UTC
Re: Clarification regarding Apache drill setup
unsubrscibe
On 8/16/19, 7:16 AM, "Nitin Pawar" <ni...@gmail.com> wrote:
This message originated from outside of DHI
From my learning and I could be wrong in few things but wait for others to
answer as well
1. When stetting up the drill cluster in prod environment to query data
ranging from several gigabytes to few terabytes hosted in s3/blob
storage/cloud storage, what are the considerations for disk space ? I
understand drill bits make use of data locality, but how does that work in
case of cloud storage like s3 ? Will the entire data from s3 be moved to
drill cluster before starting the query processing ?
It is advised to use parquet as your file formats. It improves your
performance a lot. Drill will bring all the data it needs to process for a
given query. This can be reduced if arrange your folder structure with
filterable columns such as dates etc. When you are using parquet files,
each of these files or blocks are downloaded separately by all the drillbit
servers and then based on your query patterns the data localization happens
such as when you say group by or filter and then sum etc. All the data
generally resides in memory and then starts spilling to disks based on your
query patterns.
2. Is it possible to use s3 or other cloud storage solutions for Sort,
Hash Aggregate, and Hash Join operators spill data rather than using local
disk ?
As per my understanding, only local disks are used for non-memory based
aggregations. Using the cloud based storage systems for intermediate
outputs as heavy network IO and causes huge delays in queries.
3. Is it ok to run drill production cluster without hadoop ? Is just
zookeeper quorum enough ?
You do NOT need to set up a hadoop cluster. Apache drill has no
per-requisite on hadoop for execution purposes unless you are using those
fer,eature sets of apache drill.
To run drill cluster, a zookeeper quorum is more than sufficient. From
there on based on what storage systems you use, you will need to create
storage plugins and use them.
On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <ma...@prevalent.ai>
wrote:
> Hi,
>
> My name is Manu and I am working as a Bigdata architect in a small startup
> company in Kochi, India. Our new project handles visualizing large volume
> of unstructured data in cloud storage (It can be S3, Azure blob storage or
> Google cloud storage). We are planning to use Apache Drill as SQL query
> execution engine so that we will be cloud agnostic. Unfortunately we are
> finding some key questions unanswered before moving ahead with Drill as
> our platform. Hoping you can provide some clarity and it will be much
> appreciated.
>
>
> 1. When stetting up the drill cluster in prod environment to query data
> ranging from several gigabytes to few terabytes hosted in s3/blob
> storage/cloud storage, what are the considerations for disk space ? I
> understand drill bits make use of data locality, but how does that work in
> case of cloud storage like s3 ? Will the entire data from s3 be moved to
> drill cluster before starting the query processing ?
> 2. Is it possible to use s3 or other cloud storage solutions for Sort,
> Hash Aggregate, and Hash Join operators spill data rather than using local
> disk ?
> 3. Is it ok to run drill production cluster without hadoop ? Is just
> zookeeper quorum enough ?
>
>
> I totally understand how busy you can be but if you get a chance, please
> help me to get a clarity on these items. It will be really helpful
>
> Thanks again!
> Manu Mukundan
> Bigdata Architect,
> Prevalent AI,
> manu.mukundan@prevalent.ai
>
>
>
--
Nitin Pawar