You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Paul Rogers <pa...@yahoo.com.INVALID> on 2020/05/05 08:23:46 UTC
Re: EXTERNAL: Re: Apache Drill Sizing guide

Hi Navin,

Just wanted to let you know I've not forgotten this thread. There is much here to digest; thanks for doing the research. There are many interesting performance questions raised by the results. Our goal, however, is to predict the resources needed to run the queries.

My previous note focused on disk as that is often the primary bottleneck. Indeed, for your scan-only queries, that seems to be true. Results are thrown off a bit by the large planning time relative to query execution time. That 3 second plan time needs explanation by itself.

Then you ran the full query and found it takes 50x as long. Since it reads the same data, it must be CPU bound (since queries mostly use either CPU or disk.) I'm puzzled by what the query could be doing that uses up so much time. Can you perhaps share a bit more about the query? If you look at the query profile, what is taking the bulk of the time?

Looking at the table at the end of your note, the file contains 225,27,414 records. (Is that actually 22,527,414?) The output is on the order of 20K records, or a 1000:1 reduction. How do you get from the one to the other? Just a WHERE clause? Aggregation? Windowing?

Does the query fetch the same number of Parquet columns in both the scan-only and full query? As you know, Parquet is columnar: cost is driven not only by the number of rows, but also by the number of columns. (CSV, JSON and the like are simpler: every query reads the entire file. But, of course, that is usually more expensive.)


If it does turn out that your queries are CPU-intensive because of the kind of analysis that needs to be performed, then sizing becomes more about getting enough cores rather than about dealing with disk I/O limitations.

Can you provide a bit more info about the query and the file? I'm sure you've told us about your file structure, but it is a bit hard to find among all the e-mails. (You might consider creating a JIRA ticket with all the info to avoid having to repeat yourself.)


Thanks,
- Paul

 

    On Wednesday, April 29, 2020, 9:19:49 AM PDT, Navin Bhawsar <na...@gmail.com> wrote:  
 
 Hi Paul,

I tried to follow your suggested approach , but not much luck

-- Scan Query

Planning = 3.481 sec

Queued = 0.007 sec

Execution = 0.591 sec

Total =              4.079 sec

Parquet_Row_group_Scan =

Avg Process Time = 0.036s

Max Process Time = 0.036s



CPU Usage After= 0.013 sec

CPU Usage before = .003 sec

Wall clock = 4.141 sec





Difference between CPU Time and Wall clock = 4.141 sec - 0.01 sec = 4.13
sec(maximum scan throughput for one Drill fragment )



Drill Throughput as 1095 MB / 4.13 sec          = 265 MB/Sec



Number of Fragments = 1



Per Fragments = 4.13/1 = 4.13 sec





Single Scan per Fragment = 1095 MB/4.13 = 265 MB/sec      (Non Linear)





No. of concurrent users = 30

Total size = 1320

Non Cached read = 265 MBPs



1.3 min required to complete all 30 queries



-- Full Result Query



Planning = 3.295 sec

Queued = 0.004 sec

Execution = 02 min 34.215 sec

Total =              02 min 37.514 sec

Parquet_Row_group_Scan =

Avg Process Time = 14.843s

Max Process Time = 14.843s



CPU Usage after = 1.08 sec

CPU Usage before = .003 sec

Wall clock = 212.7 sec





Difference between CPU Time and Wall clock = 212.7 sec - 1.05 sec = 211.65
sec(maximum scan throughput for one Drill fragment )



Drill Throughput as 1095 MB / 212 sec            = 5.17 MB/Sec



Number of Fragments = 2



Per Fragments = 211/2 = 105.5 sec





Single Scan per Fragment = 1095/2.25 = 486 MB/sec



There are few queries based on this approach -

1. Disk through put does not match expected disk read performance i.e. 500
Mbps.Although Disk Scan per Fragment is close to that.

2. Also in this case full query time is 50X of scan time  ,not sure how
this will give us expected cpu count for 30 users


Also drill has different performance with same cpu cores (6-core) but when
its two separate machine dill performs better,specially on scan and level
of parallelism in minor fragments.



Single Node  Single Node 2-Node
1 Core 6-Core 6-Core(3-Core per Node)
Planning 3.662 sec 0.553 sec 0.568 sec
Queued 0.007 sec 0.004 sec 0.021 sec
Execution 22.921 sec 6.833 sec 7.365 sec
Total 26.590 sec 7.390 sec 7.954 sec
No. of Fragments 2 5 7
Scan Process Time 21.124s 5.946s 4.726s
No. of Minor Fragments 1 4 6
Filter Time 0.288 sec 0.078 sec 0.053s
Scan Records 225,27,414 225,27,414 225,27,414
Scan Peak Memory 15 MB 15 MB 15 MB
No. of Rowgroups scanned 123 123 123
Filtered Rows 20,406 20,406 20,406




Thanks,

Navin

On Fri, Apr 17, 2020 at 11:05 PM Paul Rogers <pa...@yahoo.com> wrote:

> Hi Navin,
>
>
> One more factor for you to consider. The straw-man analysis we just did
> was for a file format such as CSV in which Drill must read all data within
> each HDFS block. You said you are using Parquet. One of the great features
> of Parquet is that Drill reads only the columns needed for your query. This
> makes the analysis a bit more interesting.
>
>
> First, how much data will Drill actually read? You mentioned reading 10-15
> of 150 columns. If columns are of uniform size, that might mean reading
> only 10% of each block. The best approach is to actually measure the amount
> of disk I/O. In a previous life I used the MapR file system which provided
> a wealth of such information. Perhaps your system does also. For now, let's
> assume 10%; you can replace this with the actual ratio once you measure it.
>
>
> We said that Drill will split the 1 GB file into four 256 MB blocks and
> will need 4 fragments (cores) to read them. We've just said we'd read 10%
> of that data or about 25 MB. You'll measure query run time for just scan,
> let's say it takes 1 second. (Parquet decoding is CPU intensive.) This
> means each query reads 4 * 25 MB = 100 MB in a second. Since your disk
> system can supply 500 MB/s, you can run 5 concurrent queries. More if the
> data is cached.
>
>
> We then add the full query cost as before. We made up a ratio of 2x, so
> each query takes 1 sec for scan, 2 sec to complete on 4 cores for scan plus
> 4 cores for compute. This means we can run 5 queries every 2 seconds. Your
> 30 queries would complete in 30 / 5 * 2 = 12 seconds, well within your
> 30-second SLA.
>
>
> Now you have a choice. You can provision the full 8 * 5 = 40 cores needed
> to saturate your file system. Or, you can provision fewer, maybe run 2
> concurrent queries, so 16 cores, with all 30 queries completing in 30 / 2 /
> 2 = 30 seconds. In this case, you would enable query throttling to avoid
> overloads.
>
>
> I hope this gives you a general sense for the approach: have a model,
> measure actual performance, get a ball-park estimate and test to see what
> additional factors crop up in your actual setup.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Thursday, April 16, 2020, 9:42:01 PM PDT, Navin Bhawsar <
> navin.bhawsar@gmail.com> wrote:
>
>
> Thanks Paul.. I will follow suggested approach next.
> Point noted on Rest API,do you have suggestion what interface should be
> best for larger set odbc or jdbc or any other reporting tool which gives
> better performance with drill.Our reports are mainly tabular format or
> pivot .
> jdbc we had to drop as UI client is .net
>
>
> Thanks,
> Navin
>
>
> On Fri, 17 Apr 2020, 07:35 Paul Rogers, <pa...@yahoo.com> wrote:
>
> Hi Navin,
>
>
> Thanks for the additional info. Let's take it step by step. I'll walk you
> through the kind of exercise you'll need to perform, using made-up numbers
> to make the exercise concrete. Running the same analysis with your results
> will give you a ball-park estimate of expected performance.
>
>
> As we'll see, you may end up being limited more by disk I/O than anything
> else.
>
>
> First, let's characterize the read performance. We can do this by limiting
> the query run to a single node (easiest if you have a single-node cluster
> available) and a single thread of execution:
>
>
> ALTER SESSION SET `planner.width.max_per_node` = 1
>
>
> Now, take a typical query, say the 1 GB scan. Modify the query to keep all
> the column references in the SELECT clause (the 15 columns you mentioned)
> but remove all other expressions, calculations, GROUP BY, etc. That is:
>
>
> SELECT col1, col2, ... col15
>
> FROM yourfile
>
>
> Then, add only the partitioning expression to WHERE clause to limit the
> scan to the 1GB of data you expect. Also add a "select nothing" expression
> on one of the columns:
>
>
> WHERE dir0 = ... AND dir1 = ... AND col1 = "bogus"
>
>
> This query forces Drill to read the full data amount, but immediately
> throws away the data so we can time just the scan portion of the query.
>
>
> Run this query on a single node "cluster". Use top or another command to
> check CPU seconds used by Drill before and after the query. Look at the
> query profile to determine query run time. The difference between CPU and
> wall clock time tells us how much time was spent waiting for things. (You
> an also look at the scan timings in the query profile to get a better
> estimate than overall query run time.)
>
>
> This tells us the maximum scan throughput for one Drill fragment on one of
> your CPUs. Best to do the exercise a few times and average the results
> since your file system will read cached data in the second and subsequent
> runs.
>
>
> OK, so suppose it takes 10 seconds to scan 1 GB of data. The disk can do
> 500 MB/s so the estimate the Drill throughput as 1 GB / 10 sec = 100 MB/s.
> Your numbers will, of course, be different.
>
>
> Now we can work out the benefits of parallelism. Parquet typically uses
> 256 MB or 512 MB blocks. This limits the benefit of parallelism on a 1 GB
> file. So, is the 1 GB the size of the scanned files? Or, are you scanning 1
> GB from, say, a set of files totaling, say, 10 GB? In either case, the best
> Drill can do is parallelize down to the block level, which will be 2 or 4
> threads (depending on block size) for a single 1 GB file. You can work out
> the real numbers based on your actual block size and file count.
>
>
> Suppose we can get a parallelism of 4 on our made-up 10 sec scan. The
> ideal result would be four fragments which each take 2.5 secs. We'd like to
> multiply by 30 to get totals. But, here is where things get non-linear.
>
>
> A single scan reads 1 GB / 2.5 sec = 400 MB/s, which is close to your
> uncached read rate. So, you get no real benefit from trying to run 30 of
> these queries in parallel, you can maybe do 1.25 (given these made-up
> numbers.) So, your real need is, say 6 cores to do reads, after which
> you've maxed out your disk subsystem. Adding more users just makes things
> slower, uses more memory, etc. So, better to have the users run queries
> sequentially.  You can use Drill's query queueing mechanism to limit load
> and smooth out the peaks.
>
>
> So, how long will it actually take to finish all 30 queries, assuming
> sufficient CPU? 30 users * 1 GB per user = 30 GB total / 500 MB/s read rate
> = 60 sec. to compete all 30 queries. This means that, given your hardware
> and data size, each user can issue one query per minute before overloading
> the disk subsystem. If all data is cached, you can do 8 G/sec. If each
> query is 1 GB in size, no other users besides Drill, then you can serve 8
> of your concurrent users per second, or 30 / 8 = 4 seconds to run all 30
> queries. Your real numbers will, of course, be different. As we noted,
> your analysis would follow the same steps, however.
>
>
> What does this mean? Users need "think time." If your users wants a query
> to take no longer than 30 sec, this is another way of saying that they plan
> to do one query every 30 sec, or 2 per minute. If they need to think about
> results, maybe they will issue one query per minute or one query every two
> minutes. With uncached data, you can't go any faster than one query per
> minute per user. (Or, if Anne is working late, she has the system to
> herself and can do roughly 20 queries in that minute at 2.5 sec. per query
> limited by Parquet block size.) You have to decide if that satisfies your
> 30 second SLA.
>
>
> Now let's work out the cost of the full query. Keeping the option above,
> run the full query. Compare CPU seconds and clock times. This will tell you
> the extra CPU and network suffles needed to do the processing requested in
> the query once Drill has the data. Let's say it takes 2x CPU seconds for
> the full query compared to the scan-only query. This tells us you need 2x
> the number of CPUs as we computed above: rather than 4 per query, maybe 8
> per user. (Again, your numbers will certainly be different.) Since we are
> CPU limited, if we needed, say, 6 cores to saturate the disk, we need 12 to
> both saturate the disk and do the needed extra processing. (Again, your
> numbers will be different.)
>
>
> This covers your "big" queries. The same analysis can be done for the
> "small" queries and a weighted total computed.
>
>
> We've not talked about memory. Scans need minimal memory (except for
> Parquet which has a bunch of buffers and worker threads; check the top
> command and the query profile to see what yours needs.)
>
>
> The rest of the query will require memory if you do joins, aggregations
> and sorts. Look at the query profile for the full run. Multiply the memory
> total by 30 for your 30 concurrent users. Divide by your node count. That
> is the minimum memory you need per node, though you should have, say, 2x to
> provide sufficient safety margin. On the other hand, if the queries run
> sequentially (because of disk saturation), then you only need memory for
> the number of actively running queries.
>
>
> All this could be put in a spreadsheet. (Maybe someone can create such a
> spreadsheet and attach it to a JIRA ticket so we can post it to the web
> site.)
>
>
> Also, the above makes all this look scientific. There are, however, may
> factors we've not discussed. Is Drill the only user of the file system? How
> much variation do you get in load? There are other factors not accounted
> for. Thus, the above will give you a ball-park estimate, not a precise
> sizing. Caveat emptor and all that.
>
>
> This is the approach I've used for a couple of systems. If anyone has a
> better (i.e. simpler, more accurate) approach, please share!
>
>
> Finally, a comment about the REST API. It is a wonderful tool to power the
> Drill Web console. It is helpful for small-ish result sets (1000 rows or
> fewer.) It is not really designed for large result sets and you may run
> into performance or memory issues for large result sets. This is certainly
> something we should fix, but it is what it is for now. So, keep an eye on
> that as well.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Thursday, April 16, 2020, 9:16:38 AM PDT, Navin Bhawsar <
> navin.bhawsar@gmail.com> wrote:
>
>
> *Hi Paul,*
>
> Thanks for your response.
>
>
> I have tried to add more details as advised :
>
> *Query Mix and selectivity*
>
> Query mix will be max 30 concurrent users running adhoc reporting queries
> via Drill Rest API called from ASP .Net Core(httpclient).
>
> Query mix is combination of below query load running on server
>
> 1.      queries (5-10) aggregating data over (1 GB or 1-3M records)
>
> 2.      Majority of queries aggregating data 100k records (15-25)
>
> Most of the queries are using simple filter clause and few using group by
> on 10-15 columns out of 150 columns in  Parquet File.
>
> *Performance expectation is these queries should be available in seconds
> (<= 30 secs)*
>
>
> *Partitioning* - Data is already partitioned on date and business level
> with lower level include parquet files (200-300 MB,100 K records)
>
>
>
> *Storage -*
>
> VMDK(VMware Disk) with 1 TB Size
>
> cached reads -  8000 MB/sec
>
> buffered disk reads - 500 MB/sec
>
> Drill queries parquet files on hdfs
>
>
> Deployment - HDFS on-perm are hosted on Internal Cloud Platform (IaaS)
> <https://alm-confluence.systems.uk.hsbc/confluence/pages/viewpage.action?pageId=51218200> ,spinning
> new env will be quick.
>
> Thanks,
> Navin
>
>
> *From:* Paul Rogers <pa...@yahoo.com>
> *Sent:* Tuesday, April 14, 2020 12:41 AM
> *To:* user <us...@drill.apache.org>
> *Cc:* arun.ns@gmail.com;
> *Subject:* EXTERNAL: Re: Apache Drill Sizing guide
>
>
>
> Hi Navin,
>
>
>
>
>
> Ted is absolutely right. To add a bit of context, here are some of the
> factors we've considered in the past.
>
>
>
>
>
> Queries: A simple filter scan takes the minimum resources: scan the
> tables, throw away most of the data, and deliver the rows that are needed.
> Such a use case is strongly driven by scan time. As Ted suggests,
> partitioning drives down scan cost. If every query hits the full TB of
> data, you will need many machines & disks to get adequate performance.
> Depending on your hardware, if you get 100 MB/s read performance per disk,
> it will take 10,000 seconds (three hours) to read your TB of data on one
> disk. If you have 100 disks, the time drops to 100 seconds. You didn't
> mention your storage technology: these numbers are likely entirely
> different for something like S3.
>
>
>
>
>
> So, you don't want to read the full TB. By using good partitioning
> (typically by date), you might reduce the scan by a factor of 1000. Huge
> win. And this is true whether you use Drill, Spark, Presto or Python to
> read your data.
>
>
>
>
>
> The next question is the selectivity of your queries. In the simple filter
> case, are you returning a few rows or a GB of rows? The more rows, the more
> Drill must grind through the data once it is read. This internal grinding
> requires CPU and benefits from parallelism. The amount you need depends on
> the number of rows processed per query.
>
>
>
>
>
> There is little memory needed for a pure filter query. Drill reads the
> data, tosses most rows, a returns the remainder to the client. Interesting
> queries, however, do more than filtering: they might group, join, sort and
> so on. Each of these operations carries its own cost. Joins are network
> heavy (to shuffle data). Sorts want enough memory to buffer the entire
> result set to avoid slow disk-based sorts.
>
>
>
>
>
> The query profile will provide lots of good information about the row
> count, memory usage and operators in each of your queries so you can
> determine the resources needed for each. When Ted asks you to analyze each
> query, the best way to do that is to look at the query profile and see
> which resources were needed by that query.
>
>
>
>
>
> Then, there are concurrent users. What do you mean by concurrent? 40
> people who might use Drill during the day so that only a few are active at
> the same time? Or, 40 users each watching dashboard that each run 10
> queries, updated each second, which will place a huge load on the system?
> Most humans are intermittent users. Dashboards, when overdone, can kill any
> system.
>
>
>
>
>
> Also, as Ted has said many times, if you run 40 queries a minute, and each
> takes 1 second, then concurrency turns into sequential processing. On the
> other hand, if one query uses all cluster resources for an hour, and you
> run 10 of them per hour, then the workload will fail.
>
>
>
>
>
> Once you determine the actual "concurrent concurrency" level (number of
> queries that run at the same time), work out the mix. Sum the resources for
> those concurrent queries. That tells you the cluster capacity you need
> (plus some safety margin because load is random.) Drill does have features
> to smooth out the load peaks by queuing queries. Not state-of-the-art, but
> can prevent the inevitable overloads that occur at random peak loads when
> there is not sufficient reserve capacity.
>
>
>
>
>
> You didn't mention your deployment model. In classic Hadoop days, with an
> on-prem cluster, you had to work all this out ahead of time so you could
> plan your equipment purchases 3 to 6 months in advance. In the cloud,
> however, especially with K8s, you just resize the cluster based on demand.
> Drill is not quite there yet with our K8s integration, but the team is
> making good progress and we should have a solution soon;
> contributions/feedback would be very helpful.
>
>
>
>
>
> In short, there are many factors, some rather complex. (We all know it
> should be simple, but having done this with many DBs, it just turns out
> that it never is.)
>
>
>
>
>
> We'd be happy to offer pointers if you can offer a few more specifics.
> Also, perhaps we can distill this discussion into a few pages in the Drill
> docs.
>
>
>
>
>
> Thanks,
>
>
>
> - Paul
>
>
>
>
>
> On Monday, April 13, 2020, 7:59:08 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>
>
>
>
> Navin,
>
> Your specification of 40 concurrent users and data size are only a bit less
> than half the story. Without the rest of the story, nobody will be able to
> give you even general guidance beyond a useless estimate that it will take
> between roughly 1 and 40 drillbits with with a gob of memory.
>
> To do better than such non-specific "guidance", you need to add some
> additional answers. For example,
>
> What is the query mix?
> How long do these queries run without any question of concurrency?
> Could that query speed be enhanced with better partitioning?
> How are you storing your data?
> What promises are you making to these concurrent users?
>
>
> On Mon, Apr 13, 2020 at 7:21 AM Navin Bhawsar <na...@gmail.com>
> wrote:
>
> > Hi Team ,
> >
> > We are planning to use drill to query hdfs cluster with about a terabyte
> > data in parquet file format .There will be approx. 40 concurrent users
> > using this environment .
> >
> > Can you please advise on below two points considering above workload for
> > optimum performance:
> > 1. Number of Drill bits
> > 2. Memory Config per drill bit
> >
> > Also is there any sizing guide for Apache Drill to refer.
> >
> >
> > Thanks,
> > Navin
> >
> ------------------------------
> ------------------------------
>
>
>
>
>
>
>
>
>
>
>
>
> * ****************************************************************** This
> message originated from the Internet. Its originator may or may not be who
> they claim to be and the information contained in the message and any
> attachments may or may not be accurate.
> *******************************************************************
>
>
>
> ------------------------------
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * ******************************************************************* This
> e-mail is confidential. It may also be legally privileged. If you are not
> the addressee you may not copy, forward, disclose or use any part of it. If
> you have received this message in error, please delete it and all copies
> from your system and notify the sender immediately by return e-mail.
> Internet communications cannot be guaranteed to be timely, secure, error or
> virus-free. The sender does not accept liability for any errors or
> omissions.
> ******************************************************************* "SAVE
> PAPER - THINK BEFORE YOU PRINT!" *
>
>
>
> --
> Navin Bhawsar
>
>

-- 
Navin Bhawsar