You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by sreeparna bhabani <bh...@gmail.com> on 2020/05/01 17:39:32 UTC

Heap memory and performance issue in Apache drill

Hi Team,

Kindly suggest on the below problem which we are facing in Apache Drill
while running query in Web interface. When we are executing query from
Fiddler, the query is failing after 2 mins saying "There is not enough heap
memory to run this query using the web interface" . We have tried few
memory settings. But the issue persists. The query performance is not as
expected either. Please find the details below-

Dataset details-
File type - Parquet
Size of file - 401 MB
Number of selected columns - 16
Number of rows - 15136976
There is no sort, no joins.

Drill Setup-
2 Drillbits are used
DRILL_HEAP - 16 G
DRILL_MAX_DIRECT_MEMORY - 32 G
planner.memory.max_query_memory_per_node - 31147483648
planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
details why the default is 0)
drill.exec.memory.operator.output_batch_size - 16777216

Operator profile-
Type Max process Time Max wait time Max  Peak Memory
SCREEN 0.048s 2m40s 4MB
PROJECT 0.068s 0 4MB
UNORDERED_RECEIVER 0.088s 0.020s 418MB
SINGLE_SENDER 0.011s 2m31s 8MB
PROJECT 4.187s 0 8MB
PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
How to find the reason of the wait time of SINGLE_SENDER. It is not clear
from the profile.
I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
whether it is related to this issue or not.
https://issues.apache.org/jira/browse/DRILL-7093

Please let me know if any other information is required.

Thanks n Regards,
*Sreeparna Bhabani*

Re: REST query improvements [Was: Heap memory and performance issue in Apache drill]

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Charles,

Thanks. Your SuperSet integration uses the REST API, doesn't it? Once the various PRs are done, it would be interesting to try out the new version with your SuperSet integration to learn if we see any performance difference.

Thanks,
- Paul

 

    On Tuesday, May 5, 2020, 5:29:04 PM PDT, Charles Givre <cg...@gmail.com> wrote:  
 
 Paul, 
Nice work!
--C

> On May 5, 2020, at 7:27 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One more update. Went ahead and implemented the streaming solution for REST JSON queries. The result is that REST queries run almost as fast as native or JDBC queries: the results stream directly from the query DAG out to the HTTP client with no buffering at all.
> 
> Tested with a file of 1 GB size: 1 M rows of 20 fields, each of 50 bytes. Tests run in a debugger, single threaded. A COUNT(*) query took about 10 seconds. Running a SELECT * to JSON took about 18 seconds, presumably for the cost of encoding data as JSON and streaming 1+ GB over the network.
> 
> This will help clients that use the REST JSON query API -- but only if the client itself handles the data in a streaming way (parses rows as they arrive, processes them, and disposes of them.) If the client buffers the entire result set into a data structure, then the client will run out of memory as result set sizes increases.
> 
> As noted earlier, the current JSON structure is awkward for this. A better format might be as a stream of "JSON lines" in which each line is an independent JSON object. An even better format would be binary-encoded rows of some sort to avoid repeating the field names a million times.
> 
> FWIW, it turns out that the current design assumes uniform rows. The list of column names is emitted at the start. If a schema change occurs, the set of fields will change, but there is no way to go back and amend the column name list. Not sure if anyone actually uses schema changes, but just something to be aware of if you do.
> 
> The Web query feature (display the results in a web page) still uses the buffering approach, which is probably fine because you don't want to load a 1GB result set in the browser anyway.
> 
> See DRILL-7733 for the details.
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, May 4, 2020, 12:20:15 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:  
> 
> Hi All,
> 
> Was able to reduce the memory impact of REST queries a bit by avoiding some excessive copies and duplicate in-memory objects. The changes will show up in a PR for Drill 1.18.
> 
> The approach still buffers the entire result set on the heap, which is the next thing to fix. Looks feasible to stream the results to the browser as they arrive, while keeping the same JSON structure as the current version. The current implementation sends column names, then all the data, then column types. Might make more sense to send the names and types, followed by the rows. That way, the client knows what to do with the rows as they arrive. As long as the fields are identical, changing field order should not break existing clients (unless someone implemented a brittle do-it-yourself JSON parser.)
> 
> 
> With streaming, Drill should be able to deliver any number of rows with no memory overhead due to REST. However, the current JSON-based approach is awkward for that amount of data.
> 
> We briefly mentioned some possible alternatives. For those of you who want to use REST to consume large data sets, do you have a favorite example of a tool that does a good at sending such data? Might as well avoid reinventing the wheel; would be great if Drill can just adopt the solution that works for "Tool X." Suggestions?
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer <do...@gmail.com> wrote:  
> 
> I think an okay approach to take is to use CTAS to dump your result into a folder / bucket of your choice instead of trying to receive the result directly from Drill.
> 
> The user can run a cron job or use lifecycle policies to clean up old query results if they fail to delete them manually in the code that consumes them.
> 
> However, in my own experimentation I found that when I try to do this using the REST API it will still complain about running out of memory, even though it doesn't need to buffer any results.
> 
> I think it just used a lot of memory to perform the operation regardless of whether it needs to serialize the results as JSON.
> 
> On 5/1/2020 2:51:49 PM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
> Hi All,
> 
> TL;DR: Your use case is too large for the REST API as it is currently implemented. Thee alternatives:
> 
> 1. Switch to JDBD/ODBC,
> 2. Write the results to a file rather than sending to your web client. The web client can then read the file.
> 3. Help us improve the scalability of the REST API.
> 
> The REST API is increasingly popular. Unfortunately, it's current implementation has significant limitations. All results are held in memory until the end of the query, after which they are translated to JSON. This model was fine when the REST API was used to run a few, small, sample queries in the Drill Web Console, but is not well suited to larger, production use cases.
> 
> 
> Let's roughly estimate the memory needs for your query with the current design. A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. As it turns out, none of that will be buffered in direct memory unless you also have an ORDER BY clause (where we need to hold all data in memory to do the sort.)
> 
> 
> The real cost is the simple design of the REST API. As your query runs, the REST handler stores all rows in an on-heap map of name/string pairs: one for each column in each row of your table. This is 15 M rows * 16 cols/row = 250 million keys and another 250 million string values. A quick check of the code suggests it does not do string "interning", so it is likely that each of the 15 million occurrences of each name is a separate heap object. Verifying, and fixing this would be a good short-term improvement.
> 
> 
> If your data is 4 GB uncompressed, then when expanded as above, it could easy take, say, 10 GB of help to encode as key/string pairs. The code does monitor heap size and gives you the error you reported as heap use grows too large. This obviously is not a good design, but it is how things work today. It was done quickly many years ago and has only been slightly improved since then.
> 
> Four your query, with a single Parquet file, the query will run in a single minor fragment. None of the tuning parameters you mentioned will solve your REST problem because the query itself is quite simple; it is the REST handler which is causing this particular problem.
> 
> Here is a simple way to verify this. Take your query and wrap it in:
> 
> SELECT COUNT(*) FROM ()
> 
> This will do all the work to run your query, count the results, and return a single row using the REST API. This will give you a sense of how fast the query should run if the REST API were out of the picture.
> 
> 
> As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they incrementally deliver results so that Drill need not hold the entire result set in memory. They also transfer results in a compact binary format.
> 
> It may be useful to take a step back. It is unclear the use case you are tying to solve. If your client intends to work with all 15 M rows and 16 columns, then it needs sufficient memory to buffer these results. No human or dashboard can consume that much data. So, you must be doing additional processing. Consider pushing that processing into SQL and Drill. Or, consider writing the results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering the large result set in your client. Big data tools often transform data from one set of files to another since data is too large to buffer in memory. The REST API is perfectly suitable to run that CTAS statement.
> 
> 
> That said, should the REST API be extended to be more scalable? Absolutely. Drill is open source. The community is encouraged to help expand Drill's capabilities. We've often discussed the idea of a session-oriented REST API so clients can fetch blocks of results without the need for server or client-side buffering. Easy enough to code.
> 
> 
> The key challenge is state. With the current design, the entire query runs in a single message. If the user abandons the query, the HTTP connection closes and Drill immediately releases all resources. With a REST client using multiple messages to transfer results, how do we know when the client has abandoned the query? Impala struggled with this. Some commercial tools dump results to disk and eventually delete them after x minutes or hours of inactivity. Anyone know of how other tools solve this problem?
> 
> Thanks,
> - Paul
> 
> 
> 
> On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III wrote:
> 
> Hi Sreeparna,
> 
> I know your dataset is 15 million rows and 16 columns, but how big is the
> result set you are expecting from that query?
> 
> I think that result set is too large for Drill's REST interface to handle
> especially with only 16G heap. I try to keep the REST queries in Drill to
> about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
> larger volumes without issue.
> 
> Best,
> Rafael
> 
> On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <>
> bhabani.sreeparna@gmail.com> wrote:
> 
>> Hi Team,
>> 
>> Kindly suggest on the below problem which we are facing in Apache Drill
>> while running query in Web interface. When we are executing query from
>> Fiddler, the query is failing after 2 mins saying "There is not enough heap
>> memory to run this query using the web interface" . We have tried few
>> memory settings. But the issue persists. The query performance is not as
>> expected either. Please find the details below-
>> 
>> Dataset details-
>> File type - Parquet
>> Size of file - 401 MB
>> Number of selected columns - 16
>> Number of rows - 15136976
>> There is no sort, no joins.
>> 
>> Drill Setup-
>> 2 Drillbits are used
>> DRILL_HEAP - 16 G
>> DRILL_MAX_DIRECT_MEMORY - 32 G
>> planner.memory.max_query_memory_per_node - 31147483648
>> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
>> details why the default is 0)
>> drill.exec.memory.operator.output_batch_size - 16777216
>> 
>> Operator profile-
>> Type Max process Time Max wait time Max  Peak Memory
>> SCREEN 0.048s 2m40s 4MB
>> PROJECT 0.068s 0 4MB
>> UNORDERED_RECEIVER 0.088s 0.020s 418MB
>> SINGLE_SENDER 0.011s 2m31s 8MB
>> PROJECT 4.187s 0 8MB
>> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
>> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
>> from the profile.
>> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
>> whether it is related to this issue or not.
>> https://issues.apache.org/jira/browse/DRILL-7093
>> 
>> Please let me know if any other information is required.
>> 
>> Thanks n Regards,
>> *Sreeparna Bhabani*
  

Re: REST query improvements [Was: Heap memory and performance issue in Apache drill]

Posted by Charles Givre <cg...@gmail.com>.
Paul, 
Nice work!
--C

> On May 5, 2020, at 7:27 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One more update. Went ahead and implemented the streaming solution for REST JSON queries. The result is that REST queries run almost as fast as native or JDBC queries: the results stream directly from the query DAG out to the HTTP client with no buffering at all.
> 
> Tested with a file of 1 GB size: 1 M rows of 20 fields, each of 50 bytes. Tests run in a debugger, single threaded. A COUNT(*) query took about 10 seconds. Running a SELECT * to JSON took about 18 seconds, presumably for the cost of encoding data as JSON and streaming 1+ GB over the network.
> 
> This will help clients that use the REST JSON query API -- but only if the client itself handles the data in a streaming way (parses rows as they arrive, processes them, and disposes of them.) If the client buffers the entire result set into a data structure, then the client will run out of memory as result set sizes increases.
> 
> As noted earlier, the current JSON structure is awkward for this. A better format might be as a stream of "JSON lines" in which each line is an independent JSON object. An even better format would be binary-encoded rows of some sort to avoid repeating the field names a million times.
> 
> FWIW, it turns out that the current design assumes uniform rows. The list of column names is emitted at the start. If a schema change occurs, the set of fields will change, but there is no way to go back and amend the column name list. Not sure if anyone actually uses schema changes, but just something to be aware of if you do.
> 
> The Web query feature (display the results in a web page) still uses the buffering approach, which is probably fine because you don't want to load a 1GB result set in the browser anyway.
> 
> See DRILL-7733 for the details.
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, May 4, 2020, 12:20:15 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:  
> 
> Hi All,
> 
> Was able to reduce the memory impact of REST queries a bit by avoiding some excessive copies and duplicate in-memory objects. The changes will show up in a PR for Drill 1.18.
> 
> The approach still buffers the entire result set on the heap, which is the next thing to fix. Looks feasible to stream the results to the browser as they arrive, while keeping the same JSON structure as the current version. The current implementation sends column names, then all the data, then column types. Might make more sense to send the names and types, followed by the rows. That way, the client knows what to do with the rows as they arrive. As long as the fields are identical, changing field order should not break existing clients (unless someone implemented a brittle do-it-yourself JSON parser.)
> 
> 
> With streaming, Drill should be able to deliver any number of rows with no memory overhead due to REST. However, the current JSON-based approach is awkward for that amount of data.
> 
> We briefly mentioned some possible alternatives. For those of you who want to use REST to consume large data sets, do you have a favorite example of a tool that does a good at sending such data? Might as well avoid reinventing the wheel; would be great if Drill can just adopt the solution that works for "Tool X." Suggestions?
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer <do...@gmail.com> wrote:  
> 
> I think an okay approach to take is to use CTAS to dump your result into a folder / bucket of your choice instead of trying to receive the result directly from Drill.
> 
> The user can run a cron job or use lifecycle policies to clean up old query results if they fail to delete them manually in the code that consumes them.
> 
> However, in my own experimentation I found that when I try to do this using the REST API it will still complain about running out of memory, even though it doesn't need to buffer any results.
> 
> I think it just used a lot of memory to perform the operation regardless of whether it needs to serialize the results as JSON.
> 
> On 5/1/2020 2:51:49 PM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
> Hi All,
> 
> TL;DR: Your use case is too large for the REST API as it is currently implemented. Thee alternatives:
> 
> 1. Switch to JDBD/ODBC,
> 2. Write the results to a file rather than sending to your web client. The web client can then read the file.
> 3. Help us improve the scalability of the REST API.
> 
> The REST API is increasingly popular. Unfortunately, it's current implementation has significant limitations. All results are held in memory until the end of the query, after which they are translated to JSON. This model was fine when the REST API was used to run a few, small, sample queries in the Drill Web Console, but is not well suited to larger, production use cases.
> 
> 
> Let's roughly estimate the memory needs for your query with the current design. A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. As it turns out, none of that will be buffered in direct memory unless you also have an ORDER BY clause (where we need to hold all data in memory to do the sort.)
> 
> 
> The real cost is the simple design of the REST API. As your query runs, the REST handler stores all rows in an on-heap map of name/string pairs: one for each column in each row of your table. This is 15 M rows * 16 cols/row = 250 million keys and another 250 million string values. A quick check of the code suggests it does not do string "interning", so it is likely that each of the 15 million occurrences of each name is a separate heap object. Verifying, and fixing this would be a good short-term improvement.
> 
> 
> If your data is 4 GB uncompressed, then when expanded as above, it could easy take, say, 10 GB of help to encode as key/string pairs. The code does monitor heap size and gives you the error you reported as heap use grows too large. This obviously is not a good design, but it is how things work today. It was done quickly many years ago and has only been slightly improved since then.
> 
> Four your query, with a single Parquet file, the query will run in a single minor fragment. None of the tuning parameters you mentioned will solve your REST problem because the query itself is quite simple; it is the REST handler which is causing this particular problem.
> 
> Here is a simple way to verify this. Take your query and wrap it in:
> 
> SELECT COUNT(*) FROM ()
> 
> This will do all the work to run your query, count the results, and return a single row using the REST API. This will give you a sense of how fast the query should run if the REST API were out of the picture.
> 
> 
> As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they incrementally deliver results so that Drill need not hold the entire result set in memory. They also transfer results in a compact binary format.
> 
> It may be useful to take a step back. It is unclear the use case you are tying to solve. If your client intends to work with all 15 M rows and 16 columns, then it needs sufficient memory to buffer these results. No human or dashboard can consume that much data. So, you must be doing additional processing. Consider pushing that processing into SQL and Drill. Or, consider writing the results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering the large result set in your client. Big data tools often transform data from one set of files to another since data is too large to buffer in memory. The REST API is perfectly suitable to run that CTAS statement.
> 
> 
> That said, should the REST API be extended to be more scalable? Absolutely. Drill is open source. The community is encouraged to help expand Drill's capabilities. We've often discussed the idea of a session-oriented REST API so clients can fetch blocks of results without the need for server or client-side buffering. Easy enough to code.
> 
> 
> The key challenge is state. With the current design, the entire query runs in a single message. If the user abandons the query, the HTTP connection closes and Drill immediately releases all resources. With a REST client using multiple messages to transfer results, how do we know when the client has abandoned the query? Impala struggled with this. Some commercial tools dump results to disk and eventually delete them after x minutes or hours of inactivity. Anyone know of how other tools solve this problem?
> 
> Thanks,
> - Paul
> 
> 
> 
> On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III wrote:
> 
> Hi Sreeparna,
> 
> I know your dataset is 15 million rows and 16 columns, but how big is the
> result set you are expecting from that query?
> 
> I think that result set is too large for Drill's REST interface to handle
> especially with only 16G heap. I try to keep the REST queries in Drill to
> about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
> larger volumes without issue.
> 
> Best,
> Rafael
> 
> On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <>
> bhabani.sreeparna@gmail.com> wrote:
> 
>> Hi Team,
>> 
>> Kindly suggest on the below problem which we are facing in Apache Drill
>> while running query in Web interface. When we are executing query from
>> Fiddler, the query is failing after 2 mins saying "There is not enough heap
>> memory to run this query using the web interface" . We have tried few
>> memory settings. But the issue persists. The query performance is not as
>> expected either. Please find the details below-
>> 
>> Dataset details-
>> File type - Parquet
>> Size of file - 401 MB
>> Number of selected columns - 16
>> Number of rows - 15136976
>> There is no sort, no joins.
>> 
>> Drill Setup-
>> 2 Drillbits are used
>> DRILL_HEAP - 16 G
>> DRILL_MAX_DIRECT_MEMORY - 32 G
>> planner.memory.max_query_memory_per_node - 31147483648
>> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
>> details why the default is 0)
>> drill.exec.memory.operator.output_batch_size - 16777216
>> 
>> Operator profile-
>> Type Max process Time Max wait time Max  Peak Memory
>> SCREEN 0.048s 2m40s 4MB
>> PROJECT 0.068s 0 4MB
>> UNORDERED_RECEIVER 0.088s 0.020s 418MB
>> SINGLE_SENDER 0.011s 2m31s 8MB
>> PROJECT 4.187s 0 8MB
>> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
>> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
>> from the profile.
>> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
>> whether it is related to this issue or not.
>> https://issues.apache.org/jira/browse/DRILL-7093
>> 
>> Please let me know if any other information is required.
>> 
>> Thanks n Regards,
>> *Sreeparna Bhabani*


Re: REST query improvements [Was: Heap memory and performance issue in Apache drill]

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi All,

One more update. Went ahead and implemented the streaming solution for REST JSON queries. The result is that REST queries run almost as fast as native or JDBC queries: the results stream directly from the query DAG out to the HTTP client with no buffering at all.

Tested with a file of 1 GB size: 1 M rows of 20 fields, each of 50 bytes. Tests run in a debugger, single threaded. A COUNT(*) query took about 10 seconds. Running a SELECT * to JSON took about 18 seconds, presumably for the cost of encoding data as JSON and streaming 1+ GB over the network.

This will help clients that use the REST JSON query API -- but only if the client itself handles the data in a streaming way (parses rows as they arrive, processes them, and disposes of them.) If the client buffers the entire result set into a data structure, then the client will run out of memory as result set sizes increases.

As noted earlier, the current JSON structure is awkward for this. A better format might be as a stream of "JSON lines" in which each line is an independent JSON object. An even better format would be binary-encoded rows of some sort to avoid repeating the field names a million times.

FWIW, it turns out that the current design assumes uniform rows. The list of column names is emitted at the start. If a schema change occurs, the set of fields will change, but there is no way to go back and amend the column name list. Not sure if anyone actually uses schema changes, but just something to be aware of if you do.

The Web query feature (display the results in a web page) still uses the buffering approach, which is probably fine because you don't want to load a 1GB result set in the browser anyway.

See DRILL-7733 for the details.


Thanks,
- Paul

 

    On Monday, May 4, 2020, 12:20:15 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:  
 
 Hi All,

Was able to reduce the memory impact of REST queries a bit by avoiding some excessive copies and duplicate in-memory objects. The changes will show up in a PR for Drill 1.18.

The approach still buffers the entire result set on the heap, which is the next thing to fix. Looks feasible to stream the results to the browser as they arrive, while keeping the same JSON structure as the current version. The current implementation sends column names, then all the data, then column types. Might make more sense to send the names and types, followed by the rows. That way, the client knows what to do with the rows as they arrive. As long as the fields are identical, changing field order should not break existing clients (unless someone implemented a brittle do-it-yourself JSON parser.)


With streaming, Drill should be able to deliver any number of rows with no memory overhead due to REST. However, the current JSON-based approach is awkward for that amount of data.

We briefly mentioned some possible alternatives. For those of you who want to use REST to consume large data sets, do you have a favorite example of a tool that does a good at sending such data? Might as well avoid reinventing the wheel; would be great if Drill can just adopt the solution that works for "Tool X." Suggestions?
 

Thanks,
- Paul

 

    On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer <do...@gmail.com> wrote:  
 
 I think an okay approach to take is to use CTAS to dump your result into a folder / bucket of your choice instead of trying to receive the result directly from Drill.

The user can run a cron job or use lifecycle policies to clean up old query results if they fail to delete them manually in the code that consumes them.

However, in my own experimentation I found that when I try to do this using the REST API it will still complain about running out of memory, even though it doesn't need to buffer any results.

I think it just used a lot of memory to perform the operation regardless of whether it needs to serialize the results as JSON.

On 5/1/2020 2:51:49 PM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
Hi All,

TL;DR: Your use case is too large for the REST API as it is currently implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current implementation has significant limitations. All results are held in memory until the end of the query, after which they are translated to JSON. This model was fine when the REST API was used to run a few, small, sample queries in the Drill Web Console, but is not well suited to larger, production use cases.


Let's roughly estimate the memory needs for your query with the current design. A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. As it turns out, none of that will be buffered in direct memory unless you also have an ORDER BY clause (where we need to hold all data in memory to do the sort.)


The real cost is the simple design of the REST API. As your query runs, the REST handler stores all rows in an on-heap map of name/string pairs: one for each column in each row of your table. This is 15 M rows * 16 cols/row = 250 million keys and another 250 million string values. A quick check of the code suggests it does not do string "interning", so it is likely that each of the 15 million occurrences of each name is a separate heap object. Verifying, and fixing this would be a good short-term improvement.


If your data is 4 GB uncompressed, then when expanded as above, it could easy take, say, 10 GB of help to encode as key/string pairs. The code does monitor heap size and gives you the error you reported as heap use grows too large. This obviously is not a good design, but it is how things work today. It was done quickly many years ago and has only been slightly improved since then.

Four your query, with a single Parquet file, the query will run in a single minor fragment. None of the tuning parameters you mentioned will solve your REST problem because the query itself is quite simple; it is the REST handler which is causing this particular problem.

Here is a simple way to verify this. Take your query and wrap it in:

SELECT COUNT(*) FROM ()

This will do all the work to run your query, count the results, and return a single row using the REST API. This will give you a sense of how fast the query should run if the REST API were out of the picture.


As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they incrementally deliver results so that Drill need not hold the entire result set in memory. They also transfer results in a compact binary format.

It may be useful to take a step back. It is unclear the use case you are tying to solve. If your client intends to work with all 15 M rows and 16 columns, then it needs sufficient memory to buffer these results. No human or dashboard can consume that much data. So, you must be doing additional processing. Consider pushing that processing into SQL and Drill. Or, consider writing the results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering the large result set in your client. Big data tools often transform data from one set of files to another since data is too large to buffer in memory. The REST API is perfectly suitable to run that CTAS statement.


That said, should the REST API be extended to be more scalable? Absolutely. Drill is open source. The community is encouraged to help expand Drill's capabilities. We've often discussed the idea of a session-oriented REST API so clients can fetch blocks of results without the need for server or client-side buffering. Easy enough to code.


The key challenge is state. With the current design, the entire query runs in a single message. If the user abandons the query, the HTTP connection closes and Drill immediately releases all resources. With a REST client using multiple messages to transfer results, how do we know when the client has abandoned the query? Impala struggled with this. Some commercial tools dump results to disk and eventually delete them after x minutes or hours of inactivity. Anyone know of how other tools solve this problem?

Thanks,
- Paul



On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III wrote:

Hi Sreeparna,

I know your dataset is 15 million rows and 16 columns, but how big is the
result set you are expecting from that query?

I think that result set is too large for Drill's REST interface to handle
especially with only 16G heap. I try to keep the REST queries in Drill to
about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
larger volumes without issue.

Best,
Rafael

On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <>
bhabani.sreeparna@gmail.com> wrote:

> Hi Team,
>
> Kindly suggest on the below problem which we are facing in Apache Drill
> while running query in Web interface. When we are executing query from
> Fiddler, the query is failing after 2 mins saying "There is not enough heap
> memory to run this query using the web interface" . We have tried few
> memory settings. But the issue persists. The query performance is not as
> expected either. Please find the details below-
>
> Dataset details-
> File type - Parquet
> Size of file - 401 MB
> Number of selected columns - 16
> Number of rows - 15136976
> There is no sort, no joins.
>
> Drill Setup-
> 2 Drillbits are used
> DRILL_HEAP - 16 G
> DRILL_MAX_DIRECT_MEMORY - 32 G
> planner.memory.max_query_memory_per_node - 31147483648
> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
> details why the default is 0)
> drill.exec.memory.operator.output_batch_size - 16777216
>
> Operator profile-
> Type Max process Time Max wait time Max  Peak Memory
> SCREEN 0.048s 2m40s 4MB
> PROJECT 0.068s 0 4MB
> UNORDERED_RECEIVER 0.088s 0.020s 418MB
> SINGLE_SENDER 0.011s 2m31s 8MB
> PROJECT 4.187s 0 8MB
> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
> from the profile.
> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
> whether it is related to this issue or not.
> https://issues.apache.org/jira/browse/DRILL-7093
>
> Please let me know if any other information is required.
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>    

REST query improvements [Was: Heap memory and performance issue in Apache drill]

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi All,

Was able to reduce the memory impact of REST queries a bit by avoiding some excessive copies and duplicate in-memory objects. The changes will show up in a PR for Drill 1.18.

The approach still buffers the entire result set on the heap, which is the next thing to fix. Looks feasible to stream the results to the browser as they arrive, while keeping the same JSON structure as the current version. The current implementation sends column names, then all the data, then column types. Might make more sense to send the names and types, followed by the rows. That way, the client knows what to do with the rows as they arrive. As long as the fields are identical, changing field order should not break existing clients (unless someone implemented a brittle do-it-yourself JSON parser.)


With streaming, Drill should be able to deliver any number of rows with no memory overhead due to REST. However, the current JSON-based approach is awkward for that amount of data.

We briefly mentioned some possible alternatives. For those of you who want to use REST to consume large data sets, do you have a favorite example of a tool that does a good at sending such data? Might as well avoid reinventing the wheel; would be great if Drill can just adopt the solution that works for "Tool X." Suggestions?
 

Thanks,
- Paul

 

    On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer <do...@gmail.com> wrote:  
 
 I think an okay approach to take is to use CTAS to dump your result into a folder / bucket of your choice instead of trying to receive the result directly from Drill.

The user can run a cron job or use lifecycle policies to clean up old query results if they fail to delete them manually in the code that consumes them.

However, in my own experimentation I found that when I try to do this using the REST API it will still complain about running out of memory, even though it doesn't need to buffer any results.

I think it just used a lot of memory to perform the operation regardless of whether it needs to serialize the results as JSON.

On 5/1/2020 2:51:49 PM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
Hi All,

TL;DR: Your use case is too large for the REST API as it is currently implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current implementation has significant limitations. All results are held in memory until the end of the query, after which they are translated to JSON. This model was fine when the REST API was used to run a few, small, sample queries in the Drill Web Console, but is not well suited to larger, production use cases.


Let's roughly estimate the memory needs for your query with the current design. A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. As it turns out, none of that will be buffered in direct memory unless you also have an ORDER BY clause (where we need to hold all data in memory to do the sort.)


The real cost is the simple design of the REST API. As your query runs, the REST handler stores all rows in an on-heap map of name/string pairs: one for each column in each row of your table. This is 15 M rows * 16 cols/row = 250 million keys and another 250 million string values. A quick check of the code suggests it does not do string "interning", so it is likely that each of the 15 million occurrences of each name is a separate heap object. Verifying, and fixing this would be a good short-term improvement.


If your data is 4 GB uncompressed, then when expanded as above, it could easy take, say, 10 GB of help to encode as key/string pairs. The code does monitor heap size and gives you the error you reported as heap use grows too large. This obviously is not a good design, but it is how things work today. It was done quickly many years ago and has only been slightly improved since then.

Four your query, with a single Parquet file, the query will run in a single minor fragment. None of the tuning parameters you mentioned will solve your REST problem because the query itself is quite simple; it is the REST handler which is causing this particular problem.

Here is a simple way to verify this. Take your query and wrap it in:

SELECT COUNT(*) FROM ()

This will do all the work to run your query, count the results, and return a single row using the REST API. This will give you a sense of how fast the query should run if the REST API were out of the picture.


As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they incrementally deliver results so that Drill need not hold the entire result set in memory. They also transfer results in a compact binary format.

It may be useful to take a step back. It is unclear the use case you are tying to solve. If your client intends to work with all 15 M rows and 16 columns, then it needs sufficient memory to buffer these results. No human or dashboard can consume that much data. So, you must be doing additional processing. Consider pushing that processing into SQL and Drill. Or, consider writing the results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering the large result set in your client. Big data tools often transform data from one set of files to another since data is too large to buffer in memory. The REST API is perfectly suitable to run that CTAS statement.


That said, should the REST API be extended to be more scalable? Absolutely. Drill is open source. The community is encouraged to help expand Drill's capabilities. We've often discussed the idea of a session-oriented REST API so clients can fetch blocks of results without the need for server or client-side buffering. Easy enough to code.


The key challenge is state. With the current design, the entire query runs in a single message. If the user abandons the query, the HTTP connection closes and Drill immediately releases all resources. With a REST client using multiple messages to transfer results, how do we know when the client has abandoned the query? Impala struggled with this. Some commercial tools dump results to disk and eventually delete them after x minutes or hours of inactivity. Anyone know of how other tools solve this problem?

Thanks,
- Paul



On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III wrote:

Hi Sreeparna,

I know your dataset is 15 million rows and 16 columns, but how big is the
result set you are expecting from that query?

I think that result set is too large for Drill's REST interface to handle
especially with only 16G heap. I try to keep the REST queries in Drill to
about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
larger volumes without issue.

Best,
Rafael

On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <>
bhabani.sreeparna@gmail.com> wrote:

> Hi Team,
>
> Kindly suggest on the below problem which we are facing in Apache Drill
> while running query in Web interface. When we are executing query from
> Fiddler, the query is failing after 2 mins saying "There is not enough heap
> memory to run this query using the web interface" . We have tried few
> memory settings. But the issue persists. The query performance is not as
> expected either. Please find the details below-
>
> Dataset details-
> File type - Parquet
> Size of file - 401 MB
> Number of selected columns - 16
> Number of rows - 15136976
> There is no sort, no joins.
>
> Drill Setup-
> 2 Drillbits are used
> DRILL_HEAP - 16 G
> DRILL_MAX_DIRECT_MEMORY - 32 G
> planner.memory.max_query_memory_per_node - 31147483648
> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
> details why the default is 0)
> drill.exec.memory.operator.output_batch_size - 16777216
>
> Operator profile-
> Type Max process Time Max wait time Max  Peak Memory
> SCREEN 0.048s 2m40s 4MB
> PROJECT 0.068s 0 4MB
> UNORDERED_RECEIVER 0.088s 0.020s 418MB
> SINGLE_SENDER 0.011s 2m31s 8MB
> PROJECT 4.187s 0 8MB
> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
> from the profile.
> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
> whether it is related to this issue or not.
> https://issues.apache.org/jira/browse/DRILL-7093
>
> Please let me know if any other information is required.
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>  

Re: Heap memory and performance issue in Apache drill

Posted by Dobes Vandermeer <do...@gmail.com>.
I think an okay approach to take is to use CTAS to dump your result into a folder / bucket of your choice instead of trying to receive the result directly from Drill.

The user can run a cron job or use lifecycle policies to clean up old query results if they fail to delete them manually in the code that consumes them.

However, in my own experimentation I found that when I try to do this using the REST API it will still complain about running out of memory, even though it doesn't need to buffer any results.

I think it just used a lot of memory to perform the operation regardless of whether it needs to serialize the results as JSON.

On 5/1/2020 2:51:49 PM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
Hi All,

TL;DR: Your use case is too large for the REST API as it is currently implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current implementation has significant limitations. All results are held in memory until the end of the query, after which they are translated to JSON. This model was fine when the REST API was used to run a few, small, sample queries in the Drill Web Console, but is not well suited to larger, production use cases.


Let's roughly estimate the memory needs for your query with the current design. A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. As it turns out, none of that will be buffered in direct memory unless you also have an ORDER BY clause (where we need to hold all data in memory to do the sort.)


The real cost is the simple design of the REST API. As your query runs, the REST handler stores all rows in an on-heap map of name/string pairs: one for each column in each row of your table. This is 15 M rows * 16 cols/row = 250 million keys and another 250 million string values. A quick check of the code suggests it does not do string "interning", so it is likely that each of the 15 million occurrences of each name is a separate heap object. Verifying, and fixing this would be a good short-term improvement.


If your data is 4 GB uncompressed, then when expanded as above, it could easy take, say, 10 GB of help to encode as key/string pairs. The code does monitor heap size and gives you the error you reported as heap use grows too large. This obviously is not a good design, but it is how things work today. It was done quickly many years ago and has only been slightly improved since then.

Four your query, with a single Parquet file, the query will run in a single minor fragment. None of the tuning parameters you mentioned will solve your REST problem because the query itself is quite simple; it is the REST handler which is causing this particular problem.

Here is a simple way to verify this. Take your query and wrap it in:

SELECT COUNT(*) FROM ()

This will do all the work to run your query, count the results, and return a single row using the REST API. This will give you a sense of how fast the query should run if the REST API were out of the picture.


As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they incrementally deliver results so that Drill need not hold the entire result set in memory. They also transfer results in a compact binary format.

It may be useful to take a step back. It is unclear the use case you are tying to solve. If your client intends to work with all 15 M rows and 16 columns, then it needs sufficient memory to buffer these results. No human or dashboard can consume that much data. So, you must be doing additional processing. Consider pushing that processing into SQL and Drill. Or, consider writing the results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering the large result set in your client. Big data tools often transform data from one set of files to another since data is too large to buffer in memory. The REST API is perfectly suitable to run that CTAS statement.


That said, should the REST API be extended to be more scalable? Absolutely. Drill is open source. The community is encouraged to help expand Drill's capabilities. We've often discussed the idea of a session-oriented REST API so clients can fetch blocks of results without the need for server or client-side buffering. Easy enough to code.


The key challenge is state. With the current design, the entire query runs in a single message. If the user abandons the query, the HTTP connection closes and Drill immediately releases all resources. With a REST client using multiple messages to transfer results, how do we know when the client has abandoned the query? Impala struggled with this. Some commercial tools dump results to disk and eventually delete them after x minutes or hours of inactivity. Anyone know of how other tools solve this problem?

Thanks,
- Paul



On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III wrote:

Hi Sreeparna,

I know your dataset is 15 million rows and 16 columns, but how big is the
result set you are expecting from that query?

I think that result set is too large for Drill's REST interface to handle
especially with only 16G heap. I try to keep the REST queries in Drill to
about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
larger volumes without issue.

Best,
Rafael

On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <>
bhabani.sreeparna@gmail.com> wrote:

> Hi Team,
>
> Kindly suggest on the below problem which we are facing in Apache Drill
> while running query in Web interface. When we are executing query from
> Fiddler, the query is failing after 2 mins saying "There is not enough heap
> memory to run this query using the web interface" . We have tried few
> memory settings. But the issue persists. The query performance is not as
> expected either. Please find the details below-
>
> Dataset details-
> File type - Parquet
> Size of file - 401 MB
> Number of selected columns - 16
> Number of rows - 15136976
> There is no sort, no joins.
>
> Drill Setup-
> 2 Drillbits are used
> DRILL_HEAP - 16 G
> DRILL_MAX_DIRECT_MEMORY - 32 G
> planner.memory.max_query_memory_per_node - 31147483648
> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
> details why the default is 0)
> drill.exec.memory.operator.output_batch_size - 16777216
>
> Operator profile-
> Type Max process Time Max wait time Max  Peak Memory
> SCREEN 0.048s 2m40s 4MB
> PROJECT 0.068s 0 4MB
> UNORDERED_RECEIVER 0.088s 0.020s 418MB
> SINGLE_SENDER 0.011s 2m31s 8MB
> PROJECT 4.187s 0 8MB
> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
> from the profile.
> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
> whether it is related to this issue or not.
> https://issues.apache.org/jira/browse/DRILL-7093
>
> Please let me know if any other information is required.
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>

Re: Heap memory and performance issue in Apache drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi All,

TL;DR: Your use case is too large for the REST API as it is currently implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current implementation has significant limitations. All results are held in memory until the end of the query, after which they are translated to JSON. This model was fine when the REST API was used to run a few, small, sample queries in the Drill Web Console, but is not well suited to larger, production use cases.


Let's roughly estimate the memory needs for your query with the current design. A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. As it turns out, none of that will be buffered in direct memory unless you also have an ORDER BY clause (where we need to hold all data in memory to do the sort.)


The real cost is the simple design of the REST API. As your query runs, the REST handler stores all rows in an on-heap map of name/string pairs: one for each column in each row of your table. This is 15 M rows * 16 cols/row = 250 million keys and another 250 million string values. A quick check of the code suggests it does not do string "interning", so it is likely that each of the 15 million occurrences of each name is a separate heap object. Verifying, and fixing this would be a good short-term improvement.


If your data is 4 GB uncompressed, then when expanded as above, it could easy take, say, 10 GB of help to encode as key/string pairs. The code does monitor heap size and gives you the error you reported as heap use grows too large. This obviously is not a good design, but it is how things work today. It was done quickly many years ago and has only been slightly improved since then.

Four your query, with a single Parquet file, the query will run in a single minor fragment. None of the tuning parameters you mentioned will solve your REST problem because the query itself is quite simple; it is the REST handler which is causing this particular problem.

Here is a simple way to verify this. Take your query and wrap it in:

SELECT COUNT(*) FROM (<your query here>)

This will do all the work to run your query, count the results, and return a single row using the REST API. This will give you a sense of how fast the query should run if the REST API were out of the picture.


As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they incrementally deliver results so that Drill need not hold the entire result set in memory. They also transfer results in a compact binary format.

It may be useful to take a step back. It is unclear the use case you are tying to solve. If your client intends to work with all 15 M rows and 16 columns, then it needs sufficient memory to buffer these results. No human or dashboard can consume that much data. So, you must be doing additional processing. Consider pushing that processing into SQL and Drill. Or, consider writing the results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering the large result set in your client. Big data tools often transform data from one set of files to another since data is too large to buffer in memory. The REST API is perfectly suitable to run that CTAS statement.


That said, should the REST API be extended to be more scalable? Absolutely. Drill is open source. The community is encouraged to help expand Drill's capabilities. We've often discussed the idea of a session-oriented REST API so clients can fetch blocks of results without the need for server or client-side buffering. Easy enough to code.


The key challenge is state. With the current design, the entire query runs in a single message. If the user abandons the query, the HTTP connection closes and Drill immediately releases all resources. With a REST client using multiple messages to transfer results, how do we know when the client has abandoned the query? Impala struggled with this. Some commercial tools dump results to disk and eventually delete them after x minutes or hours of inactivity. Anyone know of how other tools solve this problem?

Thanks,
- Paul

 

    On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III <ra...@gmail.com> wrote:  
 
 Hi Sreeparna,

I know your dataset is 15 million rows and 16 columns, but how big is the
result set you are expecting from that query?

I think that result set is too large for Drill's REST interface to handle
especially with only 16G heap. I try to keep the REST queries in Drill to
about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
larger volumes without issue.

Best,
Rafael

On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

> Hi Team,
>
> Kindly suggest on the below problem which we are facing in Apache Drill
> while running query in Web interface. When we are executing query from
> Fiddler, the query is failing after 2 mins saying "There is not enough heap
> memory to run this query using the web interface" . We have tried few
> memory settings. But the issue persists. The query performance is not as
> expected either. Please find the details below-
>
> Dataset details-
> File type - Parquet
> Size of file - 401 MB
> Number of selected columns - 16
> Number of rows - 15136976
> There is no sort, no joins.
>
> Drill Setup-
> 2 Drillbits are used
> DRILL_HEAP - 16 G
> DRILL_MAX_DIRECT_MEMORY - 32 G
> planner.memory.max_query_memory_per_node - 31147483648
> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
> details why the default is 0)
> drill.exec.memory.operator.output_batch_size - 16777216
>
> Operator profile-
> Type Max process Time Max wait time Max  Peak Memory
> SCREEN 0.048s 2m40s 4MB
> PROJECT 0.068s 0 4MB
> UNORDERED_RECEIVER 0.088s 0.020s 418MB
> SINGLE_SENDER 0.011s 2m31s 8MB
> PROJECT 4.187s 0 8MB
> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
> from the profile.
> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
> whether it is related to this issue or not.
> https://issues.apache.org/jira/browse/DRILL-7093
>
> Please let me know if any other information is required.
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>
  

Re: Heap memory and performance issue in Apache drill

Posted by Rafael Jaimes III <ra...@gmail.com>.
Hi Sreeparna,

I know your dataset is 15 million rows and 16 columns, but how big is the
result set you are expecting from that query?

I think that result set is too large for Drill's REST interface to handle
especially with only 16G heap. I try to keep the REST queries in Drill to
about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
larger volumes without issue.

Best,
Rafael

On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

> Hi Team,
>
> Kindly suggest on the below problem which we are facing in Apache Drill
> while running query in Web interface. When we are executing query from
> Fiddler, the query is failing after 2 mins saying "There is not enough heap
> memory to run this query using the web interface" . We have tried few
> memory settings. But the issue persists. The query performance is not as
> expected either. Please find the details below-
>
> Dataset details-
> File type - Parquet
> Size of file - 401 MB
> Number of selected columns - 16
> Number of rows - 15136976
> There is no sort, no joins.
>
> Drill Setup-
> 2 Drillbits are used
> DRILL_HEAP - 16 G
> DRILL_MAX_DIRECT_MEMORY - 32 G
> planner.memory.max_query_memory_per_node - 31147483648
> planner.width.max_per_node- 0 (DEFAULT value. Can you please share some
> details why the default is 0)
> drill.exec.memory.operator.output_batch_size - 16777216
>
> Operator profile-
> Type Max process Time Max wait time Max  Peak Memory
> SCREEN 0.048s 2m40s 4MB
> PROJECT 0.068s 0 4MB
> UNORDERED_RECEIVER 0.088s 0.020s 418MB
> SINGLE_SENDER 0.011s 2m31s 8MB
> PROJECT 4.187s 0 8MB
> PARQUET_ROW_GROUP_SCAN 1.765s 0.020s 12MB
> How to find the reason of the wait time of SINGLE_SENDER. It is not clear
> from the profile.
> I found one Jira regarding the Batch sizing of SINGLE_SENDER. Not sure
> whether it is related to this issue or not.
> https://issues.apache.org/jira/browse/DRILL-7093
>
> Please let me know if any other information is required.
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>