You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by edward choi <mp...@gmail.com> on 2011/03/21 02:39:14 UTC

Inserting many small files into HBase

Hi,

I'm planning to crawl thousands of news rss feeds via MapReduce, and save
each news article into HBase directly.

My concern is that Hadoop does not work well with a large number of
small-size files,

and if I insert every single news article (which is small-size apparently)
into HBase, (without separately storing it into HDFS)

I might end up with millions of files that are only several kilobytes in
size.

Or does HBase somehow automatically append each news article into a single
file, so that it would have only a few files of large-size?

Ed

Re: Inserting many small files into HBase

Posted by Ted Dunning <td...@maprtech.com>.
Take a look at this:

http://wiki.apache.org/hadoop/Hbase/DesignOverview

then read the bigtable paper.

On Sun, Mar 20, 2011 at 6:39 PM, edward choi <mp...@gmail.com> wrote:

> Hi,
>
> I'm planning to crawl thousands of news rss feeds via MapReduce, and save
> each news article into HBase directly.
>
> My concern is that Hadoop does not work well with a large number of
> small-size files,
>
> and if I insert every single news article (which is small-size apparently)
> into HBase, (without separately storing it into HDFS)
>
> I might end up with millions of files that are only several kilobytes in
> size.
>
> Or does HBase somehow automatically append each news article into a single
> file, so that it would have only a few files of large-size?
>
> Ed
>

Re: Inserting many small files into HBase

Posted by Ted Dunning <td...@maprtech.com>.
Take a look at this:

http://wiki.apache.org/hadoop/Hbase/DesignOverview

then read the bigtable paper.

On Sun, Mar 20, 2011 at 6:39 PM, edward choi <mp...@gmail.com> wrote:

> Hi,
>
> I'm planning to crawl thousands of news rss feeds via MapReduce, and save
> each news article into HBase directly.
>
> My concern is that Hadoop does not work well with a large number of
> small-size files,
>
> and if I insert every single news article (which is small-size apparently)
> into HBase, (without separately storing it into HDFS)
>
> I might end up with millions of files that are only several kilobytes in
> size.
>
> Or does HBase somehow automatically append each news article into a single
> file, so that it would have only a few files of large-size?
>
> Ed
>