You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Dan Blondowski <da...@dhigroupinc.com> on 2020/01/21 13:08:05 UTC

Re: Clarification regarding Apache drill setup

unsubrscibe

On 8/16/19, 7:16 AM, "Nitin Pawar" <ni...@gmail.com> wrote:

    This message originated from outside of DHI
    
    
    
    From my learning and I could be wrong in few things but wait for others to
    answer as well
    
    
    1.  When stetting up the drill cluster in prod environment to query data
    ranging from several gigabytes to few terabytes hosted in s3/blob
    storage/cloud storage, what are the considerations for disk space ? I
    understand drill bits make use of data locality, but how does that work in
    case of cloud storage like s3 ? Will the entire data from s3 be moved to
    drill cluster before starting the query processing ?
    
    It is advised to use parquet as your file formats. It improves your
    performance a lot. Drill will bring all the data it needs to process for a
    given query. This can be reduced if arrange your folder structure with
    filterable columns such as dates etc. When you are using parquet files,
    each of these files or blocks are downloaded separately by all the drillbit
    servers and then based on your query patterns the data localization happens
    such as when you say group by or filter and then sum etc. All the data
    generally resides in memory and then starts spilling to disks based on your
    query patterns.
    
      2.   Is it possible to use s3 or other cloud storage solutions for Sort,
    Hash Aggregate, and Hash Join operators spill data rather than using local
    disk ?
    As per my understanding, only local disks are used for non-memory based
    aggregations. Using the cloud based storage systems for intermediate
    outputs as heavy network IO and causes huge delays in queries.
    
    
      3.  Is it ok to run drill production cluster without hadoop ? Is just
    zookeeper quorum enough ?
    You do NOT need to set up a hadoop cluster. Apache drill has no
    per-requisite  on hadoop for execution purposes unless you are using those
    fer,eature sets of apache drill.
    To run drill cluster, a zookeeper quorum is more than sufficient. From
    there on based on what storage systems you use, you will need to create
    storage plugins and use them.
    
    On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <ma...@prevalent.ai>
    wrote:
    
    > Hi,
    >
    > My name is Manu and I am working as a Bigdata architect in a small startup
    > company in Kochi, India. Our new project handles visualizing large volume
    > of unstructured data in cloud storage (It can be S3, Azure blob storage or
    > Google cloud storage). We are planning to use Apache Drill as SQL query
    > execution engine so that we will be cloud agnostic. Unfortunately we are
    > finding some  key questions unanswered before moving ahead with Drill as
    > our platform. Hoping you can provide some clarity and it will be much
    > appreciated.
    >
    >
    >   1.  When stetting up the drill cluster in prod environment to query data
    > ranging from several gigabytes to few terabytes hosted in s3/blob
    > storage/cloud storage, what are the considerations for disk space ? I
    > understand drill bits make use of data locality, but how does that work in
    > case of cloud storage like s3 ? Will the entire data from s3 be moved to
    > drill cluster before starting the query processing ?
    >   2.   Is it possible to use s3 or other cloud storage solutions for Sort,
    > Hash Aggregate, and Hash Join operators spill data rather than using local
    > disk ?
    >   3.  Is it ok to run drill production cluster without hadoop ? Is just
    > zookeeper quorum enough ?
    >
    >
    > I totally understand how busy you can be but if you get a chance, please
    > help me to get a clarity on these items. It will be really helpful
    >
    > Thanks again!
    > Manu Mukundan
    > Bigdata Architect,
    > Prevalent AI,
    > manu.mukundan@prevalent.ai
    >
    >
    >
    
    --
    Nitin Pawar