You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Palanieppan Muthiah <pa...@troo.ly> on 2016/12/09 07:52:24 UTC

Hive 2.1.0 - writes multiple copies of same data to temp location

Hi,

I am using Hive 2.1.0 on amazon EMR. In my 'insert overwrite' job, whose
source and target tables are both in S3, i notice 2 copies of result is
created in temp directory on S3.

First the output of the query is written to temp directory (e.g: ext-10000)
in S3 by the MR job. Then the MR job completes, but the hive client still
doesn't terminate. Instead i see that the entire temp directory is copied
in S3 again, into another directory (e.g: tmp-ext-10000), file by file.

Is this a known issue? In my case, my query reads about 0.5 terabyte of
data, performs aggregation and writes back to S3. The second copy is so
slow and usually fails with NoHttpResponseException from S3.

Let me know if this is a known issue, if there are workarounds, of if there
are config options to avoid 2 copies.


Thanks,
pala

Re: Hive 2.1.0 - writes multiple copies of same data to temp location

Posted by Sergey Shelukhin <se...@hortonworks.com>.
We are addressing this in HIVE-14535, which eliminates all of the copies.
Unfortunately, it won’t be until Jan-Feb till it is finished, and it’s a
major change. I think there’s a more specific change somewhere that may
eliminate one of the copies. IIRC it may ship in 2.2?

On 16/12/8, 23:52, "Palanieppan Muthiah" <pa...@troo.ly> wrote:

>Hi,
>
>I am using Hive 2.1.0 on amazon EMR. In my 'insert overwrite' job, whose
>source and target tables are both in S3, i notice 2 copies of result is
>created in temp directory on S3.
>
>First the output of the query is written to temp directory (e.g:
>ext-10000)
>in S3 by the MR job. Then the MR job completes, but the hive client still
>doesn't terminate. Instead i see that the entire temp directory is copied
>in S3 again, into another directory (e.g: tmp-ext-10000), file by file.
>
>Is this a known issue? In my case, my query reads about 0.5 terabyte of
>data, performs aggregation and writes back to S3. The second copy is so
>slow and usually fails with NoHttpResponseException from S3.
>
>Let me know if this is a known issue, if there are workarounds, of if
>there
>are config options to avoid 2 copies.
>
>
>Thanks,
>pala