You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Daniel Jeliński <dj...@gmail.com> on 2017/04/03 09:23:46 UTC

Re: HBase as a file repository

Hi Vlad,
That looks a lot like what MOBs do today. While it could work, it seems
overly complicated compared to implementing a custom hbase client for what
HBase offers already.
Thanks,
Daniel

2017-03-31 19:25 GMT+02:00 Vladimir Rodionov <vl...@gmail.com>:

> Use HBase as a file system meta storage (index), keep files in a large
> blobs on hdfs, have periodic compaction/cleaning M/R job
> to purge deleted files. You can even keep multiple versions of files.
>
> -Vlad
>
> On Thu, Mar 30, 2017 at 11:22 PM, Jingcheng Du <du...@gmail.com> wrote:
>
> > Hi Daniel,
> >
> > I think it is because the memory burden in both clients and servers.
> > If we have a row with large size, we have to have a hfile block with a
> > large size which will heavy the burden of the block cache if the data
> block
> > would be cached. And in scanning, both region servers and clients will
> take
> > many memories to cache the rows.
> > As you know HBase uses memstore to store data before flushing them to
> > disks. A heavy write load will lead to more flush and compaction with
> rows
> > in larger sizes than small ones.
> >
> > Actually we don't have hard limitation in code for the data size, you can
> > store data that is larger than 10MB. You can try it if it works for you.
> >
> > Regards,
> > Jingcheng
> >
> > 2017-03-31 12:25 GMT+08:00 Daniel Jeliński <dj...@gmail.com>:
> >
> > > Thank you Ted for your response.
> > >
> > > I have read that part of HBase book. It never explained why objects
> over
> > > 10MB are no good, and did not suggest an alternative storage medium for
> > > these.
> > >
> > > I have also read this:
> > > http://hbase.apache.org/book.html#regionserver_sizing_rules_of_thumb
> > > And yet I'm trying to put 36TB on a machine. I certainly hope that the
> > > number of region servers is the only real limiter to this.
> > >
> > > I tried putting files larger than 1MB on HDFS, which has a streaming
> API.
> > > Datanodes started complaining about too large number of blocks; they
> seem
> > > to tolerate up to 500k blocks, which means that average block size has
> to
> > > be around 72MB to fully utilize the cluster and avoid complaining
> > > datanodes.
> > >
> > > On the surface it seems that I should conclude that HBase/HDFS is no
> good
> > > for NAS replacement and move on. But then, the HBase API currently
> seems
> > to
> > > be the only thing getting in my way.
> > >
> > > I checked async HBase projects, but apparently they're focused on
> running
> > > the requests in background, rather than returning results earlier.
> HBase
> > > streaming on Google returns just references to Spark.
> > >
> > > HBase JIRA has a few apparently related issues:
> > > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh with
> > no
> > > development yet, and https://issues.apache.org/jira/browse/HBASE-13467
> > > seems to have died already.
> > >
> > > I captured the network traffic between the client and the region server
> > > when I requested one cell, and writing a custom client seems easy
> enough.
> > > Are there any reasons other than the API that justify the 10MB limit on
> > > MOBs?
> > > Thanks,
> > > Daniel
> > >
> > >
> > >
> > > 2017-03-31 0:03 GMT+02:00 Ted Yu <yu...@gmail.com>:
> > >
> > > > Have you read:
> > > > http://hbase.apache.org/book.html#hbase_mob
> > > >
> > > > In particular:
> > > >
> > > > When using MOBs, ideally your objects will be between 100KB and 10MB
> > > >
> > > > Cheers
> > > >
> > > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <
> djelinski1@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > > I'm evaluating HBase as a cheaper replacement for NAS as a file
> > storage
> > > > > medium. To that end I have a cluster of 5 machines, 36TB HDD each;
> > I'm
> > > > > planning to initially store ~240 million files of size 1KB-100MB,
> > total
> > > > > size 30TB. Currently I'm storing each file under an individual
> > column,
> > > > and
> > > > > I group related documents in the same row. The files from the same
> > row
> > > > will
> > > > > be served one at a time, but updated/deleted together.
> > > > >
> > > > > Loading the data to the cluster went pretty well; I enabled MOB on
> > the
> > > > > table and have ~50 regions per machine. Writes to the table are
> done
> > by
> > > > an
> > > > > automated process, and cluster's performance in that area is more
> > than
> > > > > sufficient. On the other hand, reads are interactive, as the files
> > are
> > > > > served to human users over HTTP.
> > > > >
> > > > > Now. HBase Get in Java API is an atomic operation in the sense that
> > it
> > > > does
> > > > > not complete until all data is retrieved from the server. It takes
> > 100
> > > ms
> > > > > to retrieve a 1MB cell (file), and only after retrieving I am able
> to
> > > > start
> > > > > serving it to the end user. For larger cells the wait time is even
> > > > longer,
> > > > > and response times longer than 100 ms are bad for user experience.
> I
> > > > would
> > > > > like to start streaming the file over HTTP as soon as possible.
> > > > >
> > > > > What's the recommended approach to avoid or reduce the delay
> between
> > > when
> > > > > HBase starts sending the response and when the application can act
> on
> > > it?
> > > > > Thanks,
> > > > > Daniel
> > > > >
> > > >
> > >
> >
>

Re: HBase as a file repository

Posted by Ted Yu <yu...@gmail.com>.
MOB code has been evolving.
e.g. MOB code has moved away from using mapreduce jobs, making MOB feature
self-contained.

Cheers

On Mon, Apr 3, 2017 at 8:43 AM, Vladimir Rodionov <vl...@gmail.com>
wrote:

> >> That looks a lot like what MOBs do today.
>
> Not very familiar with MOB code, but
> I think there is a reason why MOB is recommended for objects less than 10MB
>
> -Vlad
>
> On Mon, Apr 3, 2017 at 2:23 AM, Daniel Jeliński <dj...@gmail.com>
> wrote:
>
> > Hi Vlad,
> > That looks a lot like what MOBs do today. While it could work, it seems
> > overly complicated compared to implementing a custom hbase client for
> what
> > HBase offers already.
> > Thanks,
> > Daniel
> >
> > 2017-03-31 19:25 GMT+02:00 Vladimir Rodionov <vl...@gmail.com>:
> >
> > > Use HBase as a file system meta storage (index), keep files in a large
> > > blobs on hdfs, have periodic compaction/cleaning M/R job
> > > to purge deleted files. You can even keep multiple versions of files.
> > >
> > > -Vlad
> > >
> > > On Thu, Mar 30, 2017 at 11:22 PM, Jingcheng Du <du...@gmail.com>
> > wrote:
> > >
> > > > Hi Daniel,
> > > >
> > > > I think it is because the memory burden in both clients and servers.
> > > > If we have a row with large size, we have to have a hfile block with
> a
> > > > large size which will heavy the burden of the block cache if the data
> > > block
> > > > would be cached. And in scanning, both region servers and clients
> will
> > > take
> > > > many memories to cache the rows.
> > > > As you know HBase uses memstore to store data before flushing them to
> > > > disks. A heavy write load will lead to more flush and compaction with
> > > rows
> > > > in larger sizes than small ones.
> > > >
> > > > Actually we don't have hard limitation in code for the data size, you
> > can
> > > > store data that is larger than 10MB. You can try it if it works for
> > you.
> > > >
> > > > Regards,
> > > > Jingcheng
> > > >
> > > > 2017-03-31 12:25 GMT+08:00 Daniel Jeliński <dj...@gmail.com>:
> > > >
> > > > > Thank you Ted for your response.
> > > > >
> > > > > I have read that part of HBase book. It never explained why objects
> > > over
> > > > > 10MB are no good, and did not suggest an alternative storage medium
> > for
> > > > > these.
> > > > >
> > > > > I have also read this:
> > > > > http://hbase.apache.org/book.html#regionserver_sizing_
> rules_of_thumb
> > > > > And yet I'm trying to put 36TB on a machine. I certainly hope that
> > the
> > > > > number of region servers is the only real limiter to this.
> > > > >
> > > > > I tried putting files larger than 1MB on HDFS, which has a
> streaming
> > > API.
> > > > > Datanodes started complaining about too large number of blocks;
> they
> > > seem
> > > > > to tolerate up to 500k blocks, which means that average block size
> > has
> > > to
> > > > > be around 72MB to fully utilize the cluster and avoid complaining
> > > > > datanodes.
> > > > >
> > > > > On the surface it seems that I should conclude that HBase/HDFS is
> no
> > > good
> > > > > for NAS replacement and move on. But then, the HBase API currently
> > > seems
> > > > to
> > > > > be the only thing getting in my way.
> > > > >
> > > > > I checked async HBase projects, but apparently they're focused on
> > > running
> > > > > the requests in background, rather than returning results earlier.
> > > HBase
> > > > > streaming on Google returns just references to Spark.
> > > > >
> > > > > HBase JIRA has a few apparently related issues:
> > > > > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh
> > with
> > > > no
> > > > > development yet, and https://issues.apache.org/
> > jira/browse/HBASE-13467
> > > > > seems to have died already.
> > > > >
> > > > > I captured the network traffic between the client and the region
> > server
> > > > > when I requested one cell, and writing a custom client seems easy
> > > enough.
> > > > > Are there any reasons other than the API that justify the 10MB
> limit
> > on
> > > > > MOBs?
> > > > > Thanks,
> > > > > Daniel
> > > > >
> > > > >
> > > > >
> > > > > 2017-03-31 0:03 GMT+02:00 Ted Yu <yu...@gmail.com>:
> > > > >
> > > > > > Have you read:
> > > > > > http://hbase.apache.org/book.html#hbase_mob
> > > > > >
> > > > > > In particular:
> > > > > >
> > > > > > When using MOBs, ideally your objects will be between 100KB and
> > 10MB
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <
> > > djelinski1@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > > I'm evaluating HBase as a cheaper replacement for NAS as a file
> > > > storage
> > > > > > > medium. To that end I have a cluster of 5 machines, 36TB HDD
> > each;
> > > > I'm
> > > > > > > planning to initially store ~240 million files of size
> 1KB-100MB,
> > > > total
> > > > > > > size 30TB. Currently I'm storing each file under an individual
> > > > column,
> > > > > > and
> > > > > > > I group related documents in the same row. The files from the
> > same
> > > > row
> > > > > > will
> > > > > > > be served one at a time, but updated/deleted together.
> > > > > > >
> > > > > > > Loading the data to the cluster went pretty well; I enabled MOB
> > on
> > > > the
> > > > > > > table and have ~50 regions per machine. Writes to the table are
> > > done
> > > > by
> > > > > > an
> > > > > > > automated process, and cluster's performance in that area is
> more
> > > > than
> > > > > > > sufficient. On the other hand, reads are interactive, as the
> > files
> > > > are
> > > > > > > served to human users over HTTP.
> > > > > > >
> > > > > > > Now. HBase Get in Java API is an atomic operation in the sense
> > that
> > > > it
> > > > > > does
> > > > > > > not complete until all data is retrieved from the server. It
> > takes
> > > > 100
> > > > > ms
> > > > > > > to retrieve a 1MB cell (file), and only after retrieving I am
> > able
> > > to
> > > > > > start
> > > > > > > serving it to the end user. For larger cells the wait time is
> > even
> > > > > > longer,
> > > > > > > and response times longer than 100 ms are bad for user
> > experience.
> > > I
> > > > > > would
> > > > > > > like to start streaming the file over HTTP as soon as possible.
> > > > > > >
> > > > > > > What's the recommended approach to avoid or reduce the delay
> > > between
> > > > > when
> > > > > > > HBase starts sending the response and when the application can
> > act
> > > on
> > > > > it?
> > > > > > > Thanks,
> > > > > > > Daniel
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HBase as a file repository

Posted by Vladimir Rodionov <vl...@gmail.com>.
>> That looks a lot like what MOBs do today.

Not very familiar with MOB code, but
I think there is a reason why MOB is recommended for objects less than 10MB

-Vlad

On Mon, Apr 3, 2017 at 2:23 AM, Daniel Jeliński <dj...@gmail.com>
wrote:

> Hi Vlad,
> That looks a lot like what MOBs do today. While it could work, it seems
> overly complicated compared to implementing a custom hbase client for what
> HBase offers already.
> Thanks,
> Daniel
>
> 2017-03-31 19:25 GMT+02:00 Vladimir Rodionov <vl...@gmail.com>:
>
> > Use HBase as a file system meta storage (index), keep files in a large
> > blobs on hdfs, have periodic compaction/cleaning M/R job
> > to purge deleted files. You can even keep multiple versions of files.
> >
> > -Vlad
> >
> > On Thu, Mar 30, 2017 at 11:22 PM, Jingcheng Du <du...@gmail.com>
> wrote:
> >
> > > Hi Daniel,
> > >
> > > I think it is because the memory burden in both clients and servers.
> > > If we have a row with large size, we have to have a hfile block with a
> > > large size which will heavy the burden of the block cache if the data
> > block
> > > would be cached. And in scanning, both region servers and clients will
> > take
> > > many memories to cache the rows.
> > > As you know HBase uses memstore to store data before flushing them to
> > > disks. A heavy write load will lead to more flush and compaction with
> > rows
> > > in larger sizes than small ones.
> > >
> > > Actually we don't have hard limitation in code for the data size, you
> can
> > > store data that is larger than 10MB. You can try it if it works for
> you.
> > >
> > > Regards,
> > > Jingcheng
> > >
> > > 2017-03-31 12:25 GMT+08:00 Daniel Jeliński <dj...@gmail.com>:
> > >
> > > > Thank you Ted for your response.
> > > >
> > > > I have read that part of HBase book. It never explained why objects
> > over
> > > > 10MB are no good, and did not suggest an alternative storage medium
> for
> > > > these.
> > > >
> > > > I have also read this:
> > > > http://hbase.apache.org/book.html#regionserver_sizing_rules_of_thumb
> > > > And yet I'm trying to put 36TB on a machine. I certainly hope that
> the
> > > > number of region servers is the only real limiter to this.
> > > >
> > > > I tried putting files larger than 1MB on HDFS, which has a streaming
> > API.
> > > > Datanodes started complaining about too large number of blocks; they
> > seem
> > > > to tolerate up to 500k blocks, which means that average block size
> has
> > to
> > > > be around 72MB to fully utilize the cluster and avoid complaining
> > > > datanodes.
> > > >
> > > > On the surface it seems that I should conclude that HBase/HDFS is no
> > good
> > > > for NAS replacement and move on. But then, the HBase API currently
> > seems
> > > to
> > > > be the only thing getting in my way.
> > > >
> > > > I checked async HBase projects, but apparently they're focused on
> > running
> > > > the requests in background, rather than returning results earlier.
> > HBase
> > > > streaming on Google returns just references to Spark.
> > > >
> > > > HBase JIRA has a few apparently related issues:
> > > > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh
> with
> > > no
> > > > development yet, and https://issues.apache.org/
> jira/browse/HBASE-13467
> > > > seems to have died already.
> > > >
> > > > I captured the network traffic between the client and the region
> server
> > > > when I requested one cell, and writing a custom client seems easy
> > enough.
> > > > Are there any reasons other than the API that justify the 10MB limit
> on
> > > > MOBs?
> > > > Thanks,
> > > > Daniel
> > > >
> > > >
> > > >
> > > > 2017-03-31 0:03 GMT+02:00 Ted Yu <yu...@gmail.com>:
> > > >
> > > > > Have you read:
> > > > > http://hbase.apache.org/book.html#hbase_mob
> > > > >
> > > > > In particular:
> > > > >
> > > > > When using MOBs, ideally your objects will be between 100KB and
> 10MB
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <
> > djelinski1@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > > I'm evaluating HBase as a cheaper replacement for NAS as a file
> > > storage
> > > > > > medium. To that end I have a cluster of 5 machines, 36TB HDD
> each;
> > > I'm
> > > > > > planning to initially store ~240 million files of size 1KB-100MB,
> > > total
> > > > > > size 30TB. Currently I'm storing each file under an individual
> > > column,
> > > > > and
> > > > > > I group related documents in the same row. The files from the
> same
> > > row
> > > > > will
> > > > > > be served one at a time, but updated/deleted together.
> > > > > >
> > > > > > Loading the data to the cluster went pretty well; I enabled MOB
> on
> > > the
> > > > > > table and have ~50 regions per machine. Writes to the table are
> > done
> > > by
> > > > > an
> > > > > > automated process, and cluster's performance in that area is more
> > > than
> > > > > > sufficient. On the other hand, reads are interactive, as the
> files
> > > are
> > > > > > served to human users over HTTP.
> > > > > >
> > > > > > Now. HBase Get in Java API is an atomic operation in the sense
> that
> > > it
> > > > > does
> > > > > > not complete until all data is retrieved from the server. It
> takes
> > > 100
> > > > ms
> > > > > > to retrieve a 1MB cell (file), and only after retrieving I am
> able
> > to
> > > > > start
> > > > > > serving it to the end user. For larger cells the wait time is
> even
> > > > > longer,
> > > > > > and response times longer than 100 ms are bad for user
> experience.
> > I
> > > > > would
> > > > > > like to start streaming the file over HTTP as soon as possible.
> > > > > >
> > > > > > What's the recommended approach to avoid or reduce the delay
> > between
> > > > when
> > > > > > HBase starts sending the response and when the application can
> act
> > on
> > > > it?
> > > > > > Thanks,
> > > > > > Daniel
> > > > > >
> > > > >
> > > >
> > >
> >
>