You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Noam Hasson <no...@kenshoo.com> on 2015/07/02 16:05:39 UTC

Insert overwrite - Move to trash part is extremely slow

Hi All,

We are running an "insert overwrite" query with dynamic partitioning in
hive,when reaching the part where it delete the partition:
*Moved:
'hdfs://BICluster/user/noamh/hive/p_noamh/dw_fact_marketing_daily_by_year/date_year=2014/ksname=KS1513/000141_0'
to trash at: hdfs://BICluster/user/noamh/.Trash/Current*

Each move command will take several seconds, while in tables with small
number of partitions each move will take about a second, in tables with
large number of partition it will take up to 10 seconds for each move.
It's important to note that this only happens on the move part, meaning
creating the table without the insert overwrite and with no need to delete
the partition it will run much much faster.

Using full trace on hive we examined the log and found that for each move
command Hive is querying hdfs for all the partitions in the table, in other
words for each move it will query hdfs thousands of times.
A table with 1000 partition, running insert overwrite will query the hdfs
1,000,000 times.

It seems like before each partition delete, Hive for some reason is listing
all the other partitions in the table.

Here is a sample of the log:
*2015-07-02 08:11:21,432 DEBUG [IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh]: ipc.Client (Client.java:receiveRpcResponse(1065)) - IPC Client
(584199013) connection to ecprdbhdp02-namenode/10.53.210.153:8020
<http://10.53.210.153:8020> from noamh got value #1939*
*2015-07-02 08:11:21,432 DEBUG [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(221)) - Call: getListing took 1ms*
*2015-07-02 08:11:21,432 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(236)) - 1: Response <-
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {dirList { partialListing { fileType: IS_FILE path: "000332_0"
length: 2217675 permission { perm: 511 } owner: "noamh" group: "dba"
modification_time: 1435838591925 access_time: 1435838591722
block_replication: 2 blocksize: 134217728 fileId: 42896800 childrenNum: 0 }
remainingEntries: 0 }}*
*2015-07-02 08:11:21,433 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(197)) - 1: Call ->
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {src:
"/user/noamh/hive/p_noamh/dw_fact_marketing_daily_by_year/date_year=2014/ksname=KS5058"
startAfter: "" needLocation: false}*
*2015-07-02 08:11:21,433 DEBUG [IPC Parameter Sending Thread #0]:
ipc.Client (Client.java:run(1008)) - IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh sending #1940*
*2015-07-02 08:11:21,434 DEBUG [IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh]: ipc.Client (Client.java:receiveRpcResponse(1065)) - IPC Client
(584199013) connection to ecprdbhdp02-namenode/10.53.210.153:8020
<http://10.53.210.153:8020> from noamh got value #1940*
*2015-07-02 08:11:21,434 DEBUG [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(221)) - Call: getListing took 1ms*
*2015-07-02 08:11:21,434 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(236)) - 1: Response <-
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {dirList { partialListing { fileType: IS_FILE path: "000333_0"
length: 189581 permission { perm: 511 } owner: "noamh" group: "dba"
modification_time: 1435838592633 access_time: 1435838592558
block_replication: 2 blocksize: 134217728 fileId: 42896823 childrenNum: 0 }
remainingEntries: 0 }}*
*2015-07-02 08:11:21,435 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(197)) - 1: Call ->
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {src:
"/user/noamh/hive/p_noamh/dw_fact_marketing_daily_by_year/date_year=2014/ksname=KS5060"
startAfter: "" needLocation: false}*
*2015-07-02 08:11:21,435 DEBUG [IPC Parameter Sending Thread #0]:
ipc.Client (Client.java:run(1008)) - IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh sending #1941*
*2015-07-02 08:11:21,436 DEBUG [IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh]: ipc.Client (Client.java:receiveRpcResponse(1065)) - IPC Client
(584199013) connection to ecprdbhdp02-namenode/10.53.210.153:8020
<http://10.53.210.153:8020> from noamh got value #1941*
*2015-07-02 08:11:21,436 DEBUG [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(221)) - Call: getListing took 1ms*
*2015-07-02 08:11:21,436 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(236)) - 1: Response <-
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {dirList { partialListing { fileType: IS_FILE path: "000356_0"
length: 8729 permission { perm: 511 } owner: "noamh" group: "dba"
modification_time: 1435838592168 access_time: 1435838591924
block_replication: 2 blocksize: 134217728 fileId: 42896807 childrenNum: 0 }
remainingEntries: 0 }}*
*2015-07-02 08:11:21,437 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(197)) - 1: Call ->
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {src:
"/user/noamh/hive/p_noamh/dw_fact_marketing_daily_by_year/date_year=2014/ksname=KS5061"
startAfter: "" needLocation: false}*
*2015-07-02 08:11:21,437 DEBUG [IPC Parameter Sending Thread #0]:
ipc.Client (Client.java:run(1008)) - IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh sending #1942*
*2015-07-02 08:11:21,438 DEBUG [IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh]: ipc.Client (Client.java:receiveRpcResponse(1065)) - IPC Client
(584199013) connection to ecprdbhdp02-namenode/10.53.210.153:8020
<http://10.53.210.153:8020> from noamh got value #1942*
*2015-07-02 08:11:21,438 DEBUG [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(221)) - Call: getListing took 1ms*
*2015-07-02 08:11:21,438 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(236)) - 1: Response <-
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {dirList { partialListing { fileType: IS_FILE path: "000357_0"
length: 5793313 permission { perm: 511 } owner: "noamh" group: "dba"
modification_time: 1435838590767 access_time: 1435838589902
block_replication: 2 blocksize: 134217728 fileId: 42896735 childrenNum: 0 }
remainingEntries: 0 }}*
*2015-07-02 08:11:21,439 TRACE [main]: ipc.ProtobufRpcEngine
(ProtobufRpcEngine.java:invoke(197)) - 1: Call ->
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020>:
getListing {src:
"/user/noamh/hive/p_noamh/dw_fact_marketing_daily_by_year/date_year=2014/ksname=KS5062"
startAfter: "" needLocation: false}*
*2015-07-02 08:11:21,439 DEBUG [IPC Parameter Sending Thread #0]:
ipc.Client (Client.java:run(1008)) - IPC Client (584199013) connection to
ecprdbhdp02-namenode/10.53.210.153:8020 <http://10.53.210.153:8020> from
noamh sending #1943*


We are working with CDH 5.3.1, Hive version 0.13.1
If anyone has any idea on how to make hive just the delete each partition
without querying all others each time.

Thanks,
Noam.

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.