You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Shushant Arora <sh...@gmail.com> on 2014/05/05 12:03:43 UTC

large small files vs one big file in hive table

I have a hive table in which data is populated from RDBMS on daily basis.

After map reduce each mapper write its data in hive table partitioned at
month level.
Issue is daily when job runs it fetches data of last day and each mapper
writes its output in seperate file. Shall I merge those files in single one
?

What should be file format? Sequence file or text is better ?

Re: large small files vs one big file in hive table

Posted by Shushant Arora <sh...@gmail.com>.

Its for performance optimisation .There are 2 requirements

1.I am gonna consume data on daily basis. Gonna run query on hive table and
fetch today's incremental data which I got from RDBMS and query on that.
2.Gonna run cumulative distinct user query on whole set.

Shall I merge output of each mapper to reduce no of files.if yes then how?
Does hadoop has some api for that.


On Tue, May 6, 2014 at 4:08 AM, Db-Blog <mp...@gmail.com> wrote:

> In general it is recommended to have Millions of Large files rather than
> billions of small files in hadoop.
>
> Please describe your issues in detail. Say for ex.
> -How are you planning to consume the data stored in this partition table?
> - Are you looking for storage and performance optimizations? Etc.
>
> Thanks
> Saurabh
>
> Sent from my iPhone, please avoid typos.
>
> > On 05-May-2014, at 3:33 pm, Shushant Arora <sh...@gmail.com>
> wrote:
> >
> > I have a hive table in which data is populated from RDBMS on daily basis.
> >
> > After map reduce each mapper write its data in hive table partitioned at
> month level.
> > Issue is daily when job runs it fetches data of last day and each mapper
> writes its output in seperate file. Shall I merge those files in single one
> ?
> >
> > What should be file format? Sequence file or text is better ?
> >
> >
>

Re: large small files vs one big file in hive table

Posted by Db-Blog <mp...@gmail.com>.

In general it is recommended to have Millions of Large files rather than billions of small files in hadoop. 

Please describe your issues in detail. Say for ex. 
-How are you planning to consume the data stored in this partition table?
- Are you looking for storage and performance optimizations? Etc. 

Thanks
Saurabh

Sent from my iPhone, please avoid typos.

> On 05-May-2014, at 3:33 pm, Shushant Arora <sh...@gmail.com> wrote:
> 
> I have a hive table in which data is populated from RDBMS on daily basis.
> 
> After map reduce each mapper write its data in hive table partitioned at month level.
> Issue is daily when job runs it fetches data of last day and each mapper writes its output in seperate file. Shall I merge those files in single one ?
> 
> What should be file format? Sequence file or text is better ?
> 
>