You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Shankar (JIRA)" <ji...@apache.org> on 2015/11/21 18:43:10 UTC

[jira] [Created] (DRILL-4118) Drillbit Foreman shutdowns while executing complex query on large amount of data

Shankar created DRILL-4118:
------------------------------

             Summary: Drillbit Foreman shutdowns while executing complex query on large amount of data
                 Key: DRILL-4118
                 URL: https://issues.apache.org/jira/browse/DRILL-4118
             Project: Apache Drill
          Issue Type: Test
    Affects Versions: 1.2.0
            Reporter: Shankar


h4.{color:DarkCyan}*System config for POC:*{color}
* Servers => AWS instances
* Total Servers => 3
* instance Type => c4.xlarge
* vCPU => 4
* Memory => 7.5 GB
* Storage Type => EBS
* OS => CentOS-6.6 ( x64 architecture)

h4.{color:DarkCyan}*Data :*{color}
* DataSize = 15 GB GZ compressed ( equivalent of 150GB of uncompressed data)
* Type of Data => json format (1 json per line)
* Persistent storage => HDFS
* Data Frequency => 1 Day data only. ( file names are divided by each hour)

h4.{color:DarkCyan}*How we setup Apache drill :*{color}
# Version = Apache Drill 1.2.0
# Setup using default configurations on all 3 nodes.
# used Drill shell to query.
# Drill Web-Console to analyze the queries.


h4.{color:green}*Query-1 (total counts):*{color}
We had run simple query for *1 hour data*.Below is the query :

 - select count(`timestamp`) from dfs.`/tmp/hadoop/20151120-10.json.gz`

- Query has taken something around 120 seconds and it ran successfully.
- cpu load => 1.5 (on an avg per node)
- memory used => 3gb (on an avg per node)



h4.{color:green}*Query-2 (distinct counts)  :*{color}
We had run simple query for *1 hour data*.Below is the query :

- select count( distinct `timestamp`) from dfs.`/tmp/hadoop/20151120-10.json.gz`

-  Query has taken something around 200 seconds and it ran successfully.
- cpu load => 5.5 (on an avg per node)
- memory used => 3.9gb (on an avg per node)



h4.{color:green}*Query-3 (create table using filter)  :*{color}

We had run simple query for *1 day data*.Below is the query :

- create table tmp as select col1, col2 from dfs.`/tmp/hadoop`
where col like '%filter-text%'

- All columns are string in natures.
- Query has taken something around 340 seconds and it ran successfully.
- cpu load => 6.2 (on an avg per node)
- memory used => 4.2gb (on an avg per node)



h4.{color:red}*Query-4 (complex query with filters) :*{color}
We had run query for *1 day data*.Below is the query :

select
count( distinct case when col like '%filter-text%' then sessions end ) as new_col_01,
count( distinct case when col like '%filter-text%' then sessions end ) as new_col_02,
------------------
------------------
------------------
count( distinct case when col like '%filter-text%' then sessions end ) as new_col_15
from dfs.`/tmp/hadoop`

-- All columns are string in natures.
-- filters conditions are different for each count clauses.
-- {color:red}from drill shell => *seemed query were still running*{color}
-- {color:red}from logs => *drillbit Foreman shutdown*{color}
- cpu load => *85.x* (on an avg per node)
- memory used => *6.6gb* (on an avg per node)

{color:red}=> Error from Log file of drillbit Foreman node{color}
----------------------------------------------------

2015-11-20 18:53:59,185 [29b058ba-2c2c-2c7b-d380-00fb51af47c2:foreman] INFO  o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 using 1 threads. Time: 41ms total, 41.774180ms avg, 41ms max.
2015-11-20 18:53:59,185 [29b058ba-2c2c-2c7b-d380-00fb51af47c2:foreman] INFO  o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out of 1 using 1 threads. Earliest start: 7.217000 μs, Latest start: 7.217000 μs, Average start: 7.217000 μs .
2015-11-20 19:06:07,320 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit - Received shutdown request.

----------------------------------------------------




h4.*Questions are:*

# Could you please tell me solution for above error ?
# Does drill-bit is needed high end servers to process large amount of data ?
# Does drill bit works well if we scale our servers horizontally with low system configurations (say 4 virtual CPU's, 8gb memory) and process large amount of data?
# Does drill bit works well if we scale our servers horizontally with low system configurations (say 8 virtual CPU's, 16gb memory) and process large amount of data?
# And finally please provide me the well tuned configuration. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)