You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Anup Tiwari <an...@games24x7.com> on 2018/03/12 08:27:07 UTC

[Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

Hi All,
From last couple of days i am stuck in a problem. I have a query which left
joins 3 drill tables(parquet), everyday it is used to take around 15-20 mins but
from last couple of days it is taking more than 45 mins and when i tried to
drill down i can see in operator profile that 40% query time is going to
PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if before
this issue the stats were same or not as earlier it gets executed in 15-20 min
max.Also on top of this a table, we used to create a table which is now showing
below error :-
SYSTEM ERROR: BlockMissingException: Could not obtain block:
BP-1083556055-10.51.2.101-1481111327179:blk_1094763477_21022752
Also in last few days i am getting frequent one or more node lost connectivity
error.
I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still there.
Any help will be appreciated.
Regards,
Anup Tiwari

Re: [Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

Posted by Anup Tiwari <an...@games24x7.com>.
Hi Kunal,
Please find my answers(highlighted) w.r.t. your questions below :-
it seems like you are trying to read from a source and write to a destination
with partitioning (or a
HashJoin/HashAgg prior to writing) : No partitions ; It is simple left join
query with CTAS and all tables of join are created in drill(parquet).
Is the data highly skewed on such a column? : We are not partitioning data on
any column however join condition is based some string column which should not
be skewed but i will check this from my end.

I will try to share json profile asap.  





On Wed, Mar 14, 2018 7:43 PM, Kunal Khatua kunal@apache.org  wrote:
Hi Anup




It helps if you can share the profile (*.sys.drill / *.json files) to help

explain. I don't think the user mailing list allows attachments, so you

could use an online document sharing service (e.g. Google Drive, etc) to do

the same.




Coming back to your description, it seems like you are trying to read from

a source and write to a destination with partitioning (or a

HashJoin/HashAgg prior to writing). If that is the case, the records are

all getting into 1 fragment most likely because of skew in the data's

unique values on which you are doing a partition.




Is the data highly skewed on such a column?










On Wed, Mar 14, 2018 at 1:16 AM, Anup Tiwari <an...@games24x7.com>

wrote:




> Also i have observed one thing, the query which is taking time is creating

> ~30-40 fragments and 99.99999% of record is getting written into only one

> fragment.

>

>

>

>

> On Wed, Mar 14, 2018 1:37 PM, Anup Tiwari anup.tiwari@games24x7.com

> wrote:

> Hi Padma,

> Please find my highlighted answer w.r.t. your question :-

> Connection loss error can happen when zookeeper thinks that a node is dead

> becauseit did not get heartbeat from the node. It can be because the node

> is

> busy or you havenetwork problems. Q) Did anything changed in your network

> ? Answer : No. Also we cross verify Intra communication within nodes and its

> working fine.

>

> Q) Is the data static or are you adding new data ? Answer : Data is static.

> Q) Do you have metadata caching enabled ?Answer : No.

> PARQUET_WRITER seem to be indicate you are doing some kind of CTAS. : This

> is

> correct, we are doing CTAS.

> The block missing exception could possibly mean some problem with name

> node or

> bad diskson one of the node. : There is no bad disk also when i checked

> that

> file from hadoop ls command and it is present so can you tell me why here

> drill

> is showing block missing? Also you have mentioned "it could possibly mean

> problem with name node"; i have checked namenode is running fine. Also we

> are

> executing some hive queries on same cluster those are running fine so if

> it is

> namenode issue then i think it should affect all queries.

>

>

>

>

> On Mon, Mar 12, 2018 11:24 PM, Padma Penumarthy ppenumarthy@mapr.com

> wrote:

> There can be lot of issues here.

>

> Connection loss error can happen when zookeeper thinks that a node is dead

> because

>

> it did not get heartbeat from the node. It can be because the node is busy

> or

> you have

>

> network problems. Did anything changed in your network ?

> Is the data static or are you adding new data ? Do you have metadata

> caching

> enabled ?

>

> PARQUET_WRITER seem to be indicate you are doing some kind of CTAS.

>

> The block missing exception could possibly mean some problem with name

> node or

> bad disks

>

> on one of the node.

>

>

>

> Thanks

>

> Padma

>

>

>

>

> On Mar 12, 2018, at 1:27 AM, Anup Tiwari <an...@games24x7.com>

>> wrote:

>>

>

>

>>

> Hi All,

>>

>

> From last couple of days i am stuck in a problem. I have a query which left

>>

>

> joins 3 drill tables(parquet), everyday it is used to take around 15-20

>> mins

>>

> but

>

> from last couple of days it is taking more than 45 mins and when i tried to

>>

>

> drill down i can see in operator profile that 40% query time is going to

>>

>

> PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if

>> before

>>

>

> this issue the stats were same or not as earlier it gets executed in 15-20

>> min

>>

>

> max.Also on top of this a table, we used to create a table which is now

>>

> showing

>

> below error :-

>>

>

> SYSTEM ERROR: BlockMissingException: Could not obtain block:

>>

>

> BP-1083556055-10.51.2.101-1481111327179:blk_1094763477_21022752

>>

>

> Also in last few days i am getting frequent one or more node lost

>> connectivity

>>

>

> error.

>>

>

> I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still

>> there.

>>

>

> Any help will be appreciated.

>>

>

> Regards,

>>

>

> Anup Tiwari

>>

>

>

>

>

>

>

>

>

>

> Regards,

> Anup Tiwari

>

>

> Regards,

> Anup Tiwari






Regards,
Anup Tiwari

Re: [Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

Posted by Kunal Khatua <ku...@apache.org>.
Hi Anup

It helps if you can share the profile (*.sys.drill / *.json files) to help
explain. I don't think the user mailing list allows attachments, so you
could use an online document sharing service (e.g. Google Drive, etc) to do
the same.

Coming back to your description, it seems like you are trying to read from
a source and write to a destination with partitioning (or a
HashJoin/HashAgg prior to writing). If that is the case, the records are
all getting into 1 fragment most likely because of skew in the data's
unique values on which you are doing a partition.

Is the data highly skewed on such a column?



On Wed, Mar 14, 2018 at 1:16 AM, Anup Tiwari <an...@games24x7.com>
wrote:

> Also i have observed one thing, the query which is taking time is creating
> ~30-40 fragments and 99.99999% of record is getting written into only one
> fragment.
>
>
>
>
> On Wed, Mar 14, 2018 1:37 PM, Anup Tiwari anup.tiwari@games24x7.com
> wrote:
> Hi Padma,
> Please find my highlighted answer w.r.t. your question :-
> Connection loss error can happen when zookeeper thinks that a node is dead
> becauseit did not get heartbeat from the node. It can be because the node
> is
> busy or you havenetwork problems. Q) Did anything changed in your network
> ? Answer : No. Also we cross verify Intra communication within nodes and its
> working fine.
>
> Q) Is the data static or are you adding new data ? Answer : Data is static.
> Q) Do you have metadata caching enabled ?Answer : No.
> PARQUET_WRITER seem to be indicate you are doing some kind of CTAS. : This
> is
> correct, we are doing CTAS.
> The block missing exception could possibly mean some problem with name
> node or
> bad diskson one of the node. : There is no bad disk also when i checked
> that
> file from hadoop ls command and it is present so can you tell me why here
> drill
> is showing block missing? Also you have mentioned "it could possibly mean
> problem with name node"; i have checked namenode is running fine. Also we
> are
> executing some hive queries on same cluster those are running fine so if
> it is
> namenode issue then i think it should affect all queries.
>
>
>
>
> On Mon, Mar 12, 2018 11:24 PM, Padma Penumarthy ppenumarthy@mapr.com
> wrote:
> There can be lot of issues here.
>
> Connection loss error can happen when zookeeper thinks that a node is dead
> because
>
> it did not get heartbeat from the node. It can be because the node is busy
> or
> you have
>
> network problems. Did anything changed in your network ?
> Is the data static or are you adding new data ? Do you have metadata
> caching
> enabled ?
>
> PARQUET_WRITER seem to be indicate you are doing some kind of CTAS.
>
> The block missing exception could possibly mean some problem with name
> node or
> bad disks
>
> on one of the node.
>
>
>
> Thanks
>
> Padma
>
>
>
>
> On Mar 12, 2018, at 1:27 AM, Anup Tiwari <an...@games24x7.com>
>> wrote:
>>
>
>
>>
> Hi All,
>>
>
> From last couple of days i am stuck in a problem. I have a query which left
>>
>
> joins 3 drill tables(parquet), everyday it is used to take around 15-20
>> mins
>>
> but
>
> from last couple of days it is taking more than 45 mins and when i tried to
>>
>
> drill down i can see in operator profile that 40% query time is going to
>>
>
> PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if
>> before
>>
>
> this issue the stats were same or not as earlier it gets executed in 15-20
>> min
>>
>
> max.Also on top of this a table, we used to create a table which is now
>>
> showing
>
> below error :-
>>
>
> SYSTEM ERROR: BlockMissingException: Could not obtain block:
>>
>
> BP-1083556055-10.51.2.101-1481111327179:blk_1094763477_21022752
>>
>
> Also in last few days i am getting frequent one or more node lost
>> connectivity
>>
>
> error.
>>
>
> I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still
>> there.
>>
>
> Any help will be appreciated.
>>
>
> Regards,
>>
>
> Anup Tiwari
>>
>
>
>
>
>
>
>
>
>
> Regards,
> Anup Tiwari
>
>
> Regards,
> Anup Tiwari

Re: [Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

Posted by Anup Tiwari <an...@games24x7.com>.
Also i have observed one thing, the query which is taking time is creating
~30-40 fragments and 99.99999% of record is getting written into only one
fragment.  





On Wed, Mar 14, 2018 1:37 PM, Anup Tiwari anup.tiwari@games24x7.com  wrote:
Hi Padma,
Please find my highlighted answer w.r.t. your question :-
Connection loss error can happen when zookeeper thinks that a node is dead
becauseit did not get heartbeat from the node. It can be because the node is
busy or you havenetwork problems. Q) Did anything changed in your network ? 
Answer : No. Also we cross verify Intra communication within nodes and its
working fine.

Q) Is the data static or are you adding new data ? Answer : Data is static.
Q) Do you have metadata caching enabled ?Answer : No.
PARQUET_WRITER seem to be indicate you are doing some kind of CTAS. : This is
correct, we are doing CTAS.
The block missing exception could possibly mean some problem with name node or
bad diskson one of the node. : There is no bad disk also when i checked that
file from hadoop ls command and it is present so can you tell me why here drill
is showing block missing? Also you have mentioned "it could possibly mean
problem with name node"; i have checked namenode is running fine. Also we are
executing some hive queries on same cluster those are running fine so if it is
namenode issue then i think it should affect all queries.  





On Mon, Mar 12, 2018 11:24 PM, Padma Penumarthy ppenumarthy@mapr.com  wrote:
There can be lot of issues here.

Connection loss error can happen when zookeeper thinks that a node is dead
because

it did not get heartbeat from the node. It can be because the node is busy or
you have

network problems. Did anything changed in your network ? 

Is the data static or are you adding new data ? Do you have metadata caching
enabled ?

PARQUET_WRITER seem to be indicate you are doing some kind of CTAS.

The block missing exception could possibly mean some problem with name node or
bad disks

on one of the node. 




Thanks

Padma




> On Mar 12, 2018, at 1:27 AM, Anup Tiwari <an...@games24x7.com> wrote:

> 

> Hi All,

> From last couple of days i am stuck in a problem. I have a query which left

> joins 3 drill tables(parquet), everyday it is used to take around 15-20 mins
but

> from last couple of days it is taking more than 45 mins and when i tried to

> drill down i can see in operator profile that 40% query time is going to

> PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if before

> this issue the stats were same or not as earlier it gets executed in 15-20 min

> max.Also on top of this a table, we used to create a table which is now
showing

> below error :-

> SYSTEM ERROR: BlockMissingException: Could not obtain block:

> BP-1083556055-10.51.2.101-1481111327179:blk_1094763477_21022752

> Also in last few days i am getting frequent one or more node lost connectivity

> error.

> I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still there.

> Any help will be appreciated.

> Regards,

> Anup Tiwari









Regards,
Anup Tiwari


Regards,
Anup Tiwari

Re: [Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

Posted by Anup Tiwari <an...@games24x7.com>.
Hi Padma,
Please find my highlighted answer w.r.t. your question :-
Connection loss error can happen when zookeeper thinks that a node is dead
becauseit did not get heartbeat from the node. It can be because the node is
busy or you havenetwork problems. Q) Did anything changed in your network ? 
Answer : No. Also we cross verify Intra communication within nodes and its
working fine.

Q) Is the data static or are you adding new data ? Answer : Data is static.
Q) Do you have metadata caching enabled ?Answer : No.
PARQUET_WRITER seem to be indicate you are doing some kind of CTAS. : This is
correct, we are doing CTAS.
The block missing exception could possibly mean some problem with name node or
bad diskson one of the node. : There is no bad disk also when i checked that
file from hadoop ls command and it is present so can you tell me why here drill
is showing block missing? Also you have mentioned "it could possibly mean
problem with name node"; i have checked namenode is running fine. Also we are
executing some hive queries on same cluster those are running fine so if it is
namenode issue then i think it should affect all queries.  





On Mon, Mar 12, 2018 11:24 PM, Padma Penumarthy ppenumarthy@mapr.com  wrote:
There can be lot of issues here.

Connection loss error can happen when zookeeper thinks that a node is dead
because

it did not get heartbeat from the node. It can be because the node is busy or
you have

network problems. Did anything changed in your network ? 

Is the data static or are you adding new data ? Do you have metadata caching
enabled ?

PARQUET_WRITER seem to be indicate you are doing some kind of CTAS.

The block missing exception could possibly mean some problem with name node or
bad disks

on one of the node. 




Thanks

Padma




> On Mar 12, 2018, at 1:27 AM, Anup Tiwari <an...@games24x7.com> wrote:

> 

> Hi All,

> From last couple of days i am stuck in a problem. I have a query which left

> joins 3 drill tables(parquet), everyday it is used to take around 15-20 mins
but

> from last couple of days it is taking more than 45 mins and when i tried to

> drill down i can see in operator profile that 40% query time is going to

> PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if before

> this issue the stats were same or not as earlier it gets executed in 15-20 min

> max.Also on top of this a table, we used to create a table which is now
showing

> below error :-

> SYSTEM ERROR: BlockMissingException: Could not obtain block:

> BP-1083556055-10.51.2.101-1481111327179:blk_1094763477_21022752

> Also in last few days i am getting frequent one or more node lost connectivity

> error.

> I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still there.

> Any help will be appreciated.

> Regards,

> Anup Tiwari









Regards,
Anup Tiwari

Re: [Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

Posted by Padma Penumarthy <pp...@mapr.com>.
There can be lot of issues here.
Connection loss error can happen when zookeeper thinks that a node is dead because
it did not get heartbeat from the node. It can be because the node is busy or you have
network problems. Did anything changed in your network ? 
Is the data static or are you adding new data ? Do you have metadata caching enabled ?
PARQUET_WRITER seem to be indicate you are doing some kind of CTAS.
The block missing exception could possibly mean some problem with name node or bad disks
on one of the node. 

Thanks
Padma

> On Mar 12, 2018, at 1:27 AM, Anup Tiwari <an...@games24x7.com> wrote:
> 
> Hi All,
> From last couple of days i am stuck in a problem. I have a query which left
> joins 3 drill tables(parquet), everyday it is used to take around 15-20 mins but
> from last couple of days it is taking more than 45 mins and when i tried to
> drill down i can see in operator profile that 40% query time is going to
> PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if before
> this issue the stats were same or not as earlier it gets executed in 15-20 min
> max.Also on top of this a table, we used to create a table which is now showing
> below error :-
> SYSTEM ERROR: BlockMissingException: Could not obtain block:
> BP-1083556055-10.51.2.101-1481111327179:blk_1094763477_21022752
> Also in last few days i am getting frequent one or more node lost connectivity
> error.
> I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still there.
> Any help will be appreciated.
> Regards,
> Anup Tiwari