You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2018/08/04 11:27:48 UTC

Optimal approach for changing file format of a partitioned table

Hi,

I’m trying to simply change the format of a very large partitioned table
from Json to ORC. I’m finding that it is unexpectedly resource intensive,
primarily due to a shuffle phase with the partition key. I end up running
out of disk space in what looks like a spill to disk in the reducers.
However, the partitioning scheme is identical on both the source and the
destination so my expectation is a map only job that simply rencodes each
file.

I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
could resolve my issue by allocating more storage to the task nodes.
However, can anyone advise a more resource and time efficient approach?

Cheers,

Elliot.

Re: Optimal approach for changing file format of a partitioned table

Posted by Gopal Vijayaraghavan <go...@apache.org>.

A hive version would help to  preface this, because that matters for this (like TEZ-3709 doesn't apply for hive-1.2).

> I’m trying to simply change the format of a very large partitioned table from Json to ORC. I’m finding that it is unexpectedly resource intensive, primarily due to a shuffle phase with the partition key. I end up running out of disk space in what looks like a spill to disk in the reducers. 

The "shuffle phase with the partition key" sounds like you have the dynamic sort partition enabled, which is necessary to avoid OOMs on the writer due to split generation complications (as you'll see below).

> However, the partitioning scheme is identical on both the source and the destination so my expectation is a map only job that simply rencodes each file.

That would've been nice to have, except the split generation in MR InputFormats will use the locality of files & split a single partition into multiple splits, then recombine them by hostname - so the splits aren't aligned along partitions.

> However, can anyone advise a more resource and time efficient approach?

If you don't have enough scratch space to store the same data 2x (well, a minimum - the shuffle merge has a complete spill for every 100 inputs), it might be helpful to do this as separate jobs (i.e relaunch AMs) so that you can delete all the scratch storage between the partitions.

The usual chunk size I use is around 30Tb per insert (this corresponds to 7 years in my warehouse tests).

I have for loop scripts which go over the data & generate chunked insert scripts, but they are somewhat trivial to write for a different use-case.

The scratch-space issue is actually tied to some assumptions in this codepath (all the way from 2007), which optimizes for shuffle via a spinning disk, for the spill + merge (to cut down on IOPS). I hope I can rewrite it entirely with something like Apache Crail (to take advantage of NVRAM+RDMA) once there's no need for compatibility with spinning disks.

However, most of the next set of optimizations require a closer inspection of the counters from the task, cluster size and total data-size.

Cheers,
Gopal

Re: Optimal approach for changing file format of a partitioned table

Posted by Furcy Pin <pi...@gmail.com>.

Hi Elliot,

From your description of the problem, I'm assuming that you are doing a
INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table

or something close, like a CREATE TABLE AS ... maybe.

If this is the case, I suspect that your shuffle phase comes from dynamic
partitioning, and in particular from this option (quote from the doc)

hive.optimize.sort.dynamic.partition
>
>    - Default Value: true in Hive 0.13.0 and 0.13.1; false in Hive 0.14.0
>    and later (HIVE-8151 <https://issues.apache.org/jira/browse/HIVE-8151>)
>
>
>    - Added In: Hive 0.13.0 with HIVE-6455
>    <https://issues.apache.org/jira/browse/HIVE-6455>
>
> When enabled, dynamic partitioning column will be globally sorted. This
> way we can keep only one record writer open for each partition value in the
> reducer thereby reducing the memory pressure on reducers.


This option has been added to avoid OOM exceptions when doing dynamic
partitioned insertions, however it has disastrous performances for table
copy operations,
where only a Map phase should suffice. Disabling this option before your
query should suffice.

Also, beware that reading from and inserting to the same partitioned table
may create deadlock issues: https://issues.apache.org/jira/browse/HIVE-12258

Regards,

Furcy


On Sat, 4 Aug 2018 at 13:28, Elliot West <te...@gmail.com> wrote:

> Hi,
>
> I’m trying to simply change the format of a very large partitioned table
> from Json to ORC. I’m finding that it is unexpectedly resource intensive,
> primarily due to a shuffle phase with the partition key. I end up running
> out of disk space in what looks like a spill to disk in the reducers.
> However, the partitioning scheme is identical on both the source and the
> destination so my expectation is a map only job that simply rencodes each
> file.
>
> I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
> could resolve my issue by allocating more storage to the task nodes.
> However, can anyone advise a more resource and time efficient approach?
>
> Cheers,
>
> Elliot.
>