You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Todd Lee <ro...@gmail.com> on 2010/07/08 02:51:04 UTC

1 big file or multiple smaller files for loading data from a database?

Hi,

I am new to Hive and Hadoop in general. I have a table in Oracle that has
millions of rows and I'd like to export it into HDFS so that I can run some
Hive queries. My first question is, is it recommended to export the entire
table as a single file (possibly 5GB), or more files with smaller sizes (10
files each 500mb)? also, does it matter if I put the files under different
sub-directories before I do the data load in Hive? or everything has to be
under the same folder?

Thanks,
T

p.s. I am sorry if this post is submitted twice.

Re: 1 big file or multiple smaller files for loading data from a database?

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Jul 7, 2010 at 9:11 PM, Todd Lee <ro...@gmail.com> wrote:
> thanks. but is it going to create 1 big file in HDFS? I am currently
> considering writing my own cascading job for this.
> thx,
> T
>
> On Wed, Jul 7, 2010 at 6:06 PM, Sarah Sproehnle <sa...@cloudera.com> wrote:
>>
>> Hi Todd,
>>
>> Are you planning to use Sqoop to do this import?  If not, you should.
>> :)  It will do a parallel import, using MapReduce, to load the table
>> into Hadoop.  With the --hive-import option, it will also create the
>> Hive table definition.
>>
>> Cheers,
>> Sarah
>>
>> On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee <ro...@gmail.com> wrote:
>> > Hi,
>> > I am new to Hive and Hadoop in general. I have a table in Oracle that
>> > has
>> > millions of rows and I'd like to export it into HDFS so that I can run
>> > some
>> > Hive queries. My first question is, is it recommended to export the
>> > entire
>> > table as a single file (possibly 5GB), or more files with smaller sizes
>> > (10
>> > files each 500mb)? also, does it matter if I put the files under
>> > different
>> > sub-directories before I do the data load in Hive? or everything has to
>> > be
>> > under the same folder?
>> > Thanks,
>> > T
>> > p.s. I am sorry if this post is submitted twice.
>>
>>
>>
>> --
>> Sarah Sproehnle
>> Educational Services
>> Cloudera, Inc
>> http://www.cloudera.com/training
>
>

Hadoop does not handle many small files well. Look up "hadoop small
file problem". Performance wise you should try to have as few files as
possible, but you should notice no difference in runtime between 1, 5
or even 500 files when your data is as big as 5GB.

Re: 1 big file or multiple smaller files for loading data from a database?

Posted by Todd Lee <ro...@gmail.com>.

thanks. but is it going to create 1 big file in HDFS? I am currently
considering writing my own cascading job for this.

thx,
T

On Wed, Jul 7, 2010 at 6:06 PM, Sarah Sproehnle <sa...@cloudera.com> wrote:

> Hi Todd,
>
> Are you planning to use Sqoop to do this import?  If not, you should.
> :)  It will do a parallel import, using MapReduce, to load the table
> into Hadoop.  With the --hive-import option, it will also create the
> Hive table definition.
>
> Cheers,
> Sarah
>
> On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee <ro...@gmail.com> wrote:
> > Hi,
> > I am new to Hive and Hadoop in general. I have a table in Oracle that has
> > millions of rows and I'd like to export it into HDFS so that I can run
> some
> > Hive queries. My first question is, is it recommended to export the
> entire
> > table as a single file (possibly 5GB), or more files with smaller sizes
> (10
> > files each 500mb)? also, does it matter if I put the files under
> different
> > sub-directories before I do the data load in Hive? or everything has to
> be
> > under the same folder?
> > Thanks,
> > T
> > p.s. I am sorry if this post is submitted twice.
>
>
>
> --
> Sarah Sproehnle
> Educational Services
> Cloudera, Inc
> http://www.cloudera.com/training
>

Re: 1 big file or multiple smaller files for loading data from a database?

Posted by Sarah Sproehnle <sa...@cloudera.com>.

Hi Todd,

Are you planning to use Sqoop to do this import?  If not, you should.
:)  It will do a parallel import, using MapReduce, to load the table
into Hadoop.  With the --hive-import option, it will also create the
Hive table definition.

Cheers,
Sarah

On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee <ro...@gmail.com> wrote:
> Hi,
> I am new to Hive and Hadoop in general. I have a table in Oracle that has
> millions of rows and I'd like to export it into HDFS so that I can run some
> Hive queries. My first question is, is it recommended to export the entire
> table as a single file (possibly 5GB), or more files with smaller sizes (10
> files each 500mb)? also, does it matter if I put the files under different
> sub-directories before I do the data load in Hive? or everything has to be
> under the same folder?
> Thanks,
> T
> p.s. I am sorry if this post is submitted twice.



-- 
Sarah Sproehnle
Educational Services
Cloudera, Inc
http://www.cloudera.com/training