You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Cosmin Cătălin Sanda <co...@gmail.com> on 2014/01/29 00:51:28 UTC

Hive dynamic partitions generate multiple files

Hi,

 I have a number of Hive jobs that run during a day. Each individual job is
outputting data to Amazon S3. The Hive jobs use dynamic partitioning.

The problem is that when different jobs need to write to the same dynamic
partition, they will each generate one file.

What I would like is for the subsequent jobs to load the existing data and
merge it with the new data. Can this be achieved somehow? Is there an
option that needs to be enabled? I already set:

SET hive.merge.mapredfiles = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

I should mention that the query that actually outputs to S3 is an INSERT
INTO TABLE query. The Hive version is 0.8.1


Thank you,
Cosmin

Re: Hive dynamic partitions generate multiple files

Posted by Cosmin Cătălin Sanda <co...@gmail.com>.

Hi Andre,

I think this is indeed the direction in which I am going to go, unless
anyone else has some other ideas :)

------------------------------------
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35



On Wed, Jan 29, 2014 at 10:43 AM, Andre Araujo <ar...@pythian.com> wrote:

> Hi, Cosmin,
>
> Functionally the the subsequent queries will work just fine (they will
> return the correct results). But you're correct in saying that it's not
> optimal.
> If the jobs always generate very small files you might end up with a huge
> number of small files, which will have a impact on the name nodes memory
> usage as well.
> In that case I think you could periodically "coalesce" the recent
> partitions. Once a week/month you can select from the more recent
> partitions and insert overwrite, which will convert all those small files
> in bigger ones.
>
> However, if the jobs are creating files that are already around the
> cluster block size, it should be fine to leave them as is.
>
> Maybe someone else has some other ideas...
>
>
> On 29 January 2014 18:05, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>
>> Hi Andre,
>>
>> The reason is that I want those partitions to go into other queries. If
>> the individual files are only a few MB than the performance will be
>> sub-optimal. As far as I understood, the individual files need to be at
>> least around 140MB for the Maps to work properly.
>>
>> ------------------------------------
>> *Cosmin Catalin SANDA*
>> Software Systems Engineer
>> Phone: +45.27.30.60.35
>>
>>
>>
>> On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ar...@pythian.com> wrote:
>>
>>> Why do you need exactly one file? This is transparent to Hive and it
>>> should treat it seamlessly. Unless you have external requirements (reading
>>> files from somewhere else) it shouldn't matter.
>>>
>>> HDFS support to file append is not a solid standard afaik, and will
>>> depend on the distribution and version you're using. In some versions file
>>> append is not available an the only way to add data to an existing Hive
>>> table is to create an additional file under the table's directory in HDFS.
>>> I haven't looked at the code but it may be that Hive developers chose this
>>> to be the default way for appending data so it works with all HDFS
>>> distributions and versions.
>>>
>>> If you need to merge multiple files under the same partition you can
>>> select everything from that partition an INSERT OVERWRITE the data again.
>>>
>>> But again, unless you have requirements external to Hive, you shouldn't
>>> be concerned about that.
>>>
>>>
>>> On 29 January 2014 11:32, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>>>
>>>> Hi Andre,
>>>>
>>>> So the thing is like this: the first time the query runs, it generates
>>>> one file per dynamic partition, The next time the query runs and it needs
>>>> to write to the same partition, it will generate another file instead of
>>>> merging with the existing one.
>>>>
>>>> Eg:
>>>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
>>>> 2. I run the query on some data and I ultimately end up having a file
>>>> in the above mentioned partition.
>>>> 3. I run the same query on some other data which ends up writing to the
>>>> same partition as above, only it doesn't take the existing file from there
>>>> and merges with it, it will generate a second file in the same partition.
>>>>
>>>>
>>>> ------------------------------------
>>>> *Cosmin Catalin SANDA*
>>>> Software Systems Engineer
>>>> Phone: +45.27.30.60.35
>>>>
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ar...@pythian.com>wrote:
>>>>
>>>>> Hi, Cosmin,
>>>>>
>>>>> Have you tried using DISTRIBUTE BY to distribute the query's data by
>>>>> the partitioning columns?
>>>>> That way all the data for each partition should be sent to the same
>>>>> reducer and should be written to a single file in each partition, I think.
>>>>>
>>>>> If your data is being distributed by a different criteria, you will
>>>>> potentially have multiple reducers writing to the same partitions.
>>>>>
>>>>> Andre
>>>>>
>>>>>
>>>>>
>>>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda <
>>>>> cosmincatalin@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I have a number of Hive jobs that run during a day. Each individual
>>>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>>>>>
>>>>>> The problem is that when different jobs need to write to the same
>>>>>> dynamic partition, they will each generate one file.
>>>>>>
>>>>>> What I would like is for the subsequent jobs to load the existing
>>>>>> data and merge it with the new data. Can this be achieved somehow? Is there
>>>>>> an option that needs to be enabled? I already set:
>>>>>>
>>>>>> SET hive.merge.mapredfiles = true;
>>>>>> SET hive.exec.dynamic.partition = true;
>>>>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>>>>
>>>>>> I should mention that the query that actually outputs to S3 is an INSERT
>>>>>> INTO TABLE query. The Hive version is 0.8.1
>>>>>>
>>>>>>
>>>>>> Thank you,
>>>>>> Cosmin
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> André Araújo
>>>>> Big Data Consultant/Solutions Architect
>>>>> The Pythian Group - Australia - www.pythian.com
>>>>>
>>>>> Office (calls from within Australia): 1300 366 021 x1270
>>>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>>>> x1270
>>>>> Mobile: +61 410 323 559
>>>>> Fax: +61 2 9805 0544
>>>>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>>>>
>>>>> “Success is not about standing at the top, it's the steps you leave
>>>>> behind.” — Iker Pou (rock climber)
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> André Araújo
>>> Big Data Consultant/Solutions Architect
>>> The Pythian Group - Australia - www.pythian.com
>>>
>>> Office (calls from within Australia): 1300 366 021 x1270
>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>> x1270
>>> Mobile: +61 410 323 559
>>> Fax: +61 2 9805 0544
>>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>>
>>> “Success is not about standing at the top, it's the steps you leave
>>> behind.” — Iker Pou (rock climber)
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>
>
> --
> André Araújo
> Big Data Consultant/Solutions Architect
> The Pythian Group - Australia - www.pythian.com
>
> Office (calls from within Australia): 1300 366 021 x1270
> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
> Mobile: +61 410 323 559
> Fax: +61 2 9805 0544
> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>
> “Success is not about standing at the top, it's the steps you leave behind.”
> — Iker Pou (rock climber)
>
> --
>
>
>
>

Re: Hive dynamic partitions generate multiple files

Posted by Andre Araujo <ar...@pythian.com>.

Hi, Cosmin,

Functionally the the subsequent queries will work just fine (they will
return the correct results). But you're correct in saying that it's not
optimal.
If the jobs always generate very small files you might end up with a huge
number of small files, which will have a impact on the name nodes memory
usage as well.
In that case I think you could periodically "coalesce" the recent
partitions. Once a week/month you can select from the more recent
partitions and insert overwrite, which will convert all those small files
in bigger ones.

However, if the jobs are creating files that are already around the cluster
block size, it should be fine to leave them as is.

Maybe someone else has some other ideas...


On 29 January 2014 18:05, Cosmin Cătălin Sanda <co...@gmail.com>wrote:

> Hi Andre,
>
> The reason is that I want those partitions to go into other queries. If
> the individual files are only a few MB than the performance will be
> sub-optimal. As far as I understood, the individual files need to be at
> least around 140MB for the Maps to work properly.
>
> ------------------------------------
> *Cosmin Catalin SANDA*
> Software Systems Engineer
> Phone: +45.27.30.60.35
>
>
>
> On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ar...@pythian.com> wrote:
>
>> Why do you need exactly one file? This is transparent to Hive and it
>> should treat it seamlessly. Unless you have external requirements (reading
>> files from somewhere else) it shouldn't matter.
>>
>> HDFS support to file append is not a solid standard afaik, and will
>> depend on the distribution and version you're using. In some versions file
>> append is not available an the only way to add data to an existing Hive
>> table is to create an additional file under the table's directory in HDFS.
>> I haven't looked at the code but it may be that Hive developers chose this
>> to be the default way for appending data so it works with all HDFS
>> distributions and versions.
>>
>> If you need to merge multiple files under the same partition you can
>> select everything from that partition an INSERT OVERWRITE the data again.
>>
>> But again, unless you have requirements external to Hive, you shouldn't
>> be concerned about that.
>>
>>
>> On 29 January 2014 11:32, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>>
>>> Hi Andre,
>>>
>>> So the thing is like this: the first time the query runs, it generates
>>> one file per dynamic partition, The next time the query runs and it needs
>>> to write to the same partition, it will generate another file instead of
>>> merging with the existing one.
>>>
>>> Eg:
>>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
>>> 2. I run the query on some data and I ultimately end up having a file in
>>> the above mentioned partition.
>>> 3. I run the same query on some other data which ends up writing to the
>>> same partition as above, only it doesn't take the existing file from there
>>> and merges with it, it will generate a second file in the same partition.
>>>
>>>
>>> ------------------------------------
>>> *Cosmin Catalin SANDA*
>>> Software Systems Engineer
>>> Phone: +45.27.30.60.35
>>>
>>>
>>>
>>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ar...@pythian.com>wrote:
>>>
>>>> Hi, Cosmin,
>>>>
>>>> Have you tried using DISTRIBUTE BY to distribute the query's data by
>>>> the partitioning columns?
>>>> That way all the data for each partition should be sent to the same
>>>> reducer and should be written to a single file in each partition, I think.
>>>>
>>>> If your data is being distributed by a different criteria, you will
>>>> potentially have multiple reducers writing to the same partitions.
>>>>
>>>> Andre
>>>>
>>>>
>>>>
>>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda <cosmincatalin@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  I have a number of Hive jobs that run during a day. Each individual
>>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>>>>
>>>>> The problem is that when different jobs need to write to the same
>>>>> dynamic partition, they will each generate one file.
>>>>>
>>>>> What I would like is for the subsequent jobs to load the existing data
>>>>> and merge it with the new data. Can this be achieved somehow? Is there an
>>>>> option that needs to be enabled? I already set:
>>>>>
>>>>> SET hive.merge.mapredfiles = true;
>>>>> SET hive.exec.dynamic.partition = true;
>>>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>>>
>>>>> I should mention that the query that actually outputs to S3 is an INSERT
>>>>> INTO TABLE query. The Hive version is 0.8.1
>>>>>
>>>>>
>>>>> Thank you,
>>>>> Cosmin
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> André Araújo
>>>> Big Data Consultant/Solutions Architect
>>>> The Pythian Group - Australia - www.pythian.com
>>>>
>>>> Office (calls from within Australia): 1300 366 021 x1270
>>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>>> x1270
>>>> Mobile: +61 410 323 559
>>>> Fax: +61 2 9805 0544
>>>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>>>
>>>> “Success is not about standing at the top, it's the steps you leave
>>>> behind.” — Iker Pou (rock climber)
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> André Araújo
>> Big Data Consultant/Solutions Architect
>> The Pythian Group - Australia - www.pythian.com
>>
>> Office (calls from within Australia): 1300 366 021 x1270
>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>> x1270
>> Mobile: +61 410 323 559
>> Fax: +61 2 9805 0544
>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>
>> “Success is not about standing at the top, it's the steps you leave
>> behind.” — Iker Pou (rock climber)
>>
>> --
>>
>>
>>
>>
>


-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--

Re: Hive dynamic partitions generate multiple files

Posted by Cosmin Cătălin Sanda <co...@gmail.com>.

Hi Andre,

The reason is that I want those partitions to go into other queries. If the
individual files are only a few MB than the performance will be
sub-optimal. As far as I understood, the individual files need to be at
least around 140MB for the Maps to work properly.

------------------------------------
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35



On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <ar...@pythian.com> wrote:

> Why do you need exactly one file? This is transparent to Hive and it
> should treat it seamlessly. Unless you have external requirements (reading
> files from somewhere else) it shouldn't matter.
>
> HDFS support to file append is not a solid standard afaik, and will depend
> on the distribution and version you're using. In some versions file append
> is not available an the only way to add data to an existing Hive table is
> to create an additional file under the table's directory in HDFS. I haven't
> looked at the code but it may be that Hive developers chose this to be the
> default way for appending data so it works with all HDFS distributions and
> versions.
>
> If you need to merge multiple files under the same partition you can
> select everything from that partition an INSERT OVERWRITE the data again.
>
> But again, unless you have requirements external to Hive, you shouldn't be
> concerned about that.
>
>
> On 29 January 2014 11:32, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>
>> Hi Andre,
>>
>> So the thing is like this: the first time the query runs, it generates
>> one file per dynamic partition, The next time the query runs and it needs
>> to write to the same partition, it will generate another file instead of
>> merging with the existing one.
>>
>> Eg:
>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
>> 2. I run the query on some data and I ultimately end up having a file in
>> the above mentioned partition.
>> 3. I run the same query on some other data which ends up writing to the
>> same partition as above, only it doesn't take the existing file from there
>> and merges with it, it will generate a second file in the same partition.
>>
>>
>> ------------------------------------
>> *Cosmin Catalin SANDA*
>> Software Systems Engineer
>> Phone: +45.27.30.60.35
>>
>>
>>
>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ar...@pythian.com> wrote:
>>
>>> Hi, Cosmin,
>>>
>>> Have you tried using DISTRIBUTE BY to distribute the query's data by the
>>> partitioning columns?
>>> That way all the data for each partition should be sent to the same
>>> reducer and should be written to a single file in each partition, I think.
>>>
>>> If your data is being distributed by a different criteria, you will
>>> potentially have multiple reducers writing to the same partitions.
>>>
>>> Andre
>>>
>>>
>>>
>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>>  I have a number of Hive jobs that run during a day. Each individual
>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>>>
>>>> The problem is that when different jobs need to write to the same
>>>> dynamic partition, they will each generate one file.
>>>>
>>>> What I would like is for the subsequent jobs to load the existing data
>>>> and merge it with the new data. Can this be achieved somehow? Is there an
>>>> option that needs to be enabled? I already set:
>>>>
>>>> SET hive.merge.mapredfiles = true;
>>>> SET hive.exec.dynamic.partition = true;
>>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>>
>>>> I should mention that the query that actually outputs to S3 is an INSERT
>>>> INTO TABLE query. The Hive version is 0.8.1
>>>>
>>>>
>>>> Thank you,
>>>> Cosmin
>>>>
>>>
>>>
>>>
>>> --
>>> André Araújo
>>> Big Data Consultant/Solutions Architect
>>> The Pythian Group - Australia - www.pythian.com
>>>
>>> Office (calls from within Australia): 1300 366 021 x1270
>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>> x1270
>>> Mobile: +61 410 323 559
>>> Fax: +61 2 9805 0544
>>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>>
>>> “Success is not about standing at the top, it's the steps you leave
>>> behind.” — Iker Pou (rock climber)
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>
>
> --
> André Araújo
> Big Data Consultant/Solutions Architect
> The Pythian Group - Australia - www.pythian.com
>
> Office (calls from within Australia): 1300 366 021 x1270
> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
> Mobile: +61 410 323 559
> Fax: +61 2 9805 0544
> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>
> “Success is not about standing at the top, it's the steps you leave behind.”
> — Iker Pou (rock climber)
>
> --
>
>
>
>

Re: Hive dynamic partitions generate multiple files

Posted by Andre Araujo <ar...@pythian.com>.

Why do you need exactly one file? This is transparent to Hive and it should
treat it seamlessly. Unless you have external requirements (reading files
from somewhere else) it shouldn't matter.

HDFS support to file append is not a solid standard afaik, and will depend
on the distribution and version you're using. In some versions file append
is not available an the only way to add data to an existing Hive table is
to create an additional file under the table's directory in HDFS. I haven't
looked at the code but it may be that Hive developers chose this to be the
default way for appending data so it works with all HDFS distributions and
versions.

If you need to merge multiple files under the same partition you can select
everything from that partition an INSERT OVERWRITE the data again.

But again, unless you have requirements external to Hive, you shouldn't be
concerned about that.


On 29 January 2014 11:32, Cosmin Cătălin Sanda <co...@gmail.com>wrote:

> Hi Andre,
>
> So the thing is like this: the first time the query runs, it generates one
> file per dynamic partition, The next time the query runs and it needs to
> write to the same partition, it will generate another file instead of
> merging with the existing one.
>
> Eg:
> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
> 2. I run the query on some data and I ultimately end up having a file in
> the above mentioned partition.
> 3. I run the same query on some other data which ends up writing to the
> same partition as above, only it doesn't take the existing file from there
> and merges with it, it will generate a second file in the same partition.
>
>
> ------------------------------------
> *Cosmin Catalin SANDA*
> Software Systems Engineer
> Phone: +45.27.30.60.35
>
>
>
> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ar...@pythian.com> wrote:
>
>> Hi, Cosmin,
>>
>> Have you tried using DISTRIBUTE BY to distribute the query's data by the
>> partitioning columns?
>> That way all the data for each partition should be sent to the same
>> reducer and should be written to a single file in each partition, I think.
>>
>> If your data is being distributed by a different criteria, you will
>> potentially have multiple reducers writing to the same partitions.
>>
>> Andre
>>
>>
>>
>> On 29 January 2014 10:51, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>>  I have a number of Hive jobs that run during a day. Each individual job
>>> is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>>
>>> The problem is that when different jobs need to write to the same
>>> dynamic partition, they will each generate one file.
>>>
>>> What I would like is for the subsequent jobs to load the existing data
>>> and merge it with the new data. Can this be achieved somehow? Is there an
>>> option that needs to be enabled? I already set:
>>>
>>> SET hive.merge.mapredfiles = true;
>>> SET hive.exec.dynamic.partition = true;
>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>
>>> I should mention that the query that actually outputs to S3 is an INSERT
>>> INTO TABLE query. The Hive version is 0.8.1
>>>
>>>
>>> Thank you,
>>> Cosmin
>>>
>>
>>
>>
>> --
>> André Araújo
>> Big Data Consultant/Solutions Architect
>> The Pythian Group - Australia - www.pythian.com
>>
>> Office (calls from within Australia): 1300 366 021 x1270
>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>> x1270
>> Mobile: +61 410 323 559
>> Fax: +61 2 9805 0544
>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>
>> “Success is not about standing at the top, it's the steps you leave
>> behind.” — Iker Pou (rock climber)
>>
>> --
>>
>>
>>
>>
>


-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--

Re: Hive dynamic partitions generate multiple files

Posted by Cosmin Cătălin Sanda <co...@gmail.com>.

Hi Andre,

So the thing is like this: the first time the query runs, it generates one
file per dynamic partition, The next time the query runs and it needs to
write to the same partition, it will generate another file instead of
merging with the existing one.

Eg:
1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
2. I run the query on some data and I ultimately end up having a file in
the above mentioned partition.
3. I run the same query on some other data which ends up writing to the
same partition as above, only it doesn't take the existing file from there
and merges with it, it will generate a second file in the same partition.


------------------------------------
*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35



On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <ar...@pythian.com> wrote:

> Hi, Cosmin,
>
> Have you tried using DISTRIBUTE BY to distribute the query's data by the
> partitioning columns?
> That way all the data for each partition should be sent to the same
> reducer and should be written to a single file in each partition, I think.
>
> If your data is being distributed by a different criteria, you will
> potentially have multiple reducers writing to the same partitions.
>
> Andre
>
>
>
> On 29 January 2014 10:51, Cosmin Cătălin Sanda <co...@gmail.com>wrote:
>
>> Hi,
>>
>>  I have a number of Hive jobs that run during a day. Each individual job
>> is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>
>> The problem is that when different jobs need to write to the same dynamic
>> partition, they will each generate one file.
>>
>> What I would like is for the subsequent jobs to load the existing data
>> and merge it with the new data. Can this be achieved somehow? Is there an
>> option that needs to be enabled? I already set:
>>
>> SET hive.merge.mapredfiles = true;
>> SET hive.exec.dynamic.partition = true;
>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>
>> I should mention that the query that actually outputs to S3 is an INSERT
>> INTO TABLE query. The Hive version is 0.8.1
>>
>>
>> Thank you,
>> Cosmin
>>
>
>
>
> --
> André Araújo
> Big Data Consultant/Solutions Architect
> The Pythian Group - Australia - www.pythian.com
>
> Office (calls from within Australia): 1300 366 021 x1270
> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
> Mobile: +61 410 323 559
> Fax: +61 2 9805 0544
> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>
> “Success is not about standing at the top, it's the steps you leave behind.”
> — Iker Pou (rock climber)
>
> --
>
>
>
>

Re: Hive dynamic partitions generate multiple files

Posted by Andre Araujo <ar...@pythian.com>.

Hi, Cosmin,

Have you tried using DISTRIBUTE BY to distribute the query's data by the
partitioning columns?
That way all the data for each partition should be sent to the same reducer
and should be written to a single file in each partition, I think.

If your data is being distributed by a different criteria, you will
potentially have multiple reducers writing to the same partitions.

Andre



On 29 January 2014 10:51, Cosmin Cătălin Sanda <co...@gmail.com>wrote:

> Hi,
>
>  I have a number of Hive jobs that run during a day. Each individual job
> is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>
> The problem is that when different jobs need to write to the same dynamic
> partition, they will each generate one file.
>
> What I would like is for the subsequent jobs to load the existing data and
> merge it with the new data. Can this be achieved somehow? Is there an
> option that needs to be enabled? I already set:
>
> SET hive.merge.mapredfiles = true;
> SET hive.exec.dynamic.partition = true;
> SET hive.exec.dynamic.partition.mode = nonstrict;
>
> I should mention that the query that actually outputs to S3 is an INSERT
> INTO TABLE query. The Hive version is 0.8.1
>
>
> Thank you,
> Cosmin
>



-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--