You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Aaron Beppu <ab...@siftscience.com> on 2014/12/18 03:44:45 UTC

Efficient use of buffered writes in a post-HTablePool world?

Hi All,

TLDR; in the absence of HTablePool, if HTable instances are short-lived,
how should clients use buffered writes?

I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
(CDH5.2). One issue I’m confused by is how to effectively use buffered
writes now that HTablePool has been deprecated[1].

In our 0.94 code, a pathway could get a table from the pool, configure it
with table.setAutoFlush(false); and write Puts to it. Those writes would
then go to the table instance’s writeBuffer, and those writes would only be
flushed when the buffer was full, or when we were ready to close out the
pool. We were intentionally choosing to have fewer, larger writes from the
client to the cluster, and we knew we were giving up a degree of safety in
exchange (i.e. if the client dies after it’s accepted a write but before
the flush for that write occurs, the data is lost). This seems to be a
generally considered a reasonable choice (cf the HBase Book [2] SS 14.8.4)

However in the 0.98 world, without HTablePool, the endorsed pattern [3]
seems to be to create a new HTable via table =
stashedHConnection.getTable(tableName, myExecutorService). However, even if
we do table.setAutoFlush(false), because that table instance is
short-lived, its buffer never gets full. We’ll create a table instance,
write a put to it, try to close the table, and the close call will trigger
a (synchronous) flush. Thus, not having HTablePool seems like it would
cause us to have many more small writes from the client to the cluster, and
basically wipe out the advantage of turning off autoflush.

More concretely :

// Given these two helpers ...

private HTableInterface getAutoFlushTable(String tableName) throws IOException {
  // (autoflush is true by default)
  return storedConnection.getTable(tableName, executorService);
}

private HTableInterface getBufferedTable(String tableName) throws IOException {
  HTableInterface table = getAutoFlushTable(tableName);
  table.setAutoFlush(false);
  return table;
}

// it's my contention that these two methods would behave almost identically,
// except the first will hit a synchronous flush during the put call,
and the second will
// flush during the (hidden) close call on table.

private void writeAutoFlushed(Put somePut) throws IOException {
  try (HTableInterface table = getAutoFlushTable(tableName)) {
    table.put(somePut); // will do synchronous flush
  }
}

private void writeBuffered(Put somePut) throws IOException {
  try (HTableInterface table = getBufferedTable(tableName)) {
    table.put(somePut);
  } // auto-close will trigger synchronous flush
}

It seems like the only way to avoid this is to have long-lived HTable
instances, which get reused for multiple writes. However, since the actual
writes are driven from highly concurrent code, and since HTable is not
threadsafe, this would involve having a number of HTable instances, and a
control mechanism for leasing them out to individual threads safely. Except
at this point it seems like we will have recreated HTablePool, which
suggests that we’re doing something deeply wrong.

What am I missing here? Since the HTableInterface.setAutoFlush method still
exists, it must be anticipated that users will still want to buffer writes.
What’s the recommended way to actually buffer a meaningful number of
writes, from a multithreaded context, that doesn’t just amount to creating
a table pool?

Thanks in advance,
Aaron

[1] https://issues.apache.org/jira/browse/HBASE-6580
[2] http://hbase.apache.org/book/perf.writing.html
[3]
https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Solomon Duskis <sd...@gmail.com>.

@Devaraja,

Would you mind posting that on
https://issues.apache.org/jira/browse/HBASE-12728?  The HBase group is
talking about this topic on that JIRA issue.

Thanks,

-Solomon

On Wed, Dec 24, 2014 at 9:40 PM, Devaraja Swami <de...@gmail.com>
wrote:
>
> Would like to add my perspective as a user. (Thanks to Aaron Beppu for
> uncovering this hidden issue). In my applications, I have some tables for
> which I need autoflushing, and others for which I need a write buffer. Plus
> the size of the write buffer is different for different tables.
> All these seem to imply that the HBase client side will need to maintain
> and operate write buffers on a per-table basis, whether or not the
> ephemeral Table/HTableInterface instances come and go (ie., are closed).
> The question then, as Nick points out, is what entity is responsible for
> flushing the buffers. By elimination, my feeling is that this would end up
> being the Connection instance.
>
> On Fri, Dec 19, 2014 at 5:55 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > Could be in an API-compatible way, though semantics would change, which
> is
> > probably worse. Table keeps these methods. When setAutoFlush is used,
> write
> > buffer managed by connection is created. If multiple Table instances for
> > the same table setWriteBufferSize(), perhaps the largest value wins.
> Writes
> > across these instances all hit the same buffer. What's not clear here is
> > who owns the ExecutorService(s) that handles flushing the buffer.
> >
> > My original thought was make this a blocker of 1.0, but we've shipped
> 0.96
> > and 0.98 this way, so we have to keep API and semantics around for
> backward
> > compatibility anyway. Doesn't mean we can't so the new API better though.
> > HTablePool is still in 1.0, so this would be thinking ahead to the fancy
> > new Table-based API. If we drop these two methods from Table, we can ship
> > with a feature gap between old and new API, resolve this in 1.1. Folks
> who
> > need this kind of pooling can continue to use HTablePool with HTables.
> >
> > On Friday, December 19, 2014, Solomon Duskis <sd...@gmail.com> wrote:
> >
> > > My first thought based on this discussion was that it would require
> > moving
> > > some methods (setAutoFlush() and setWriteBufferSize()) from Table to
> > > Connection.  That would be a breaking API change.
> > >
> > > -Solomon
> > >
> > > On Fri, Dec 19, 2014 at 3:04 PM, Andrew Purtell <apurtell@apache.org
> > > <javascript:;>> wrote:
> > > >
> > > > I think it would be critical if we're contemplating something that
> > > requires
> > > > a breaking API change? Do we have that here? I'm not sure.
> > > >
> > > > On Fri, Dec 19, 2014 at 12:02 PM, Solomon Duskis <sduskis@gmail.com
> > > <javascript:;>>
> > > > wrote:
> > > > >
> > > > > Is this critical to sort out before 1.0, or is fixing this a
> post-1.0
> > > > > enhancement?
> > > > >
> > > > > -Solomon
> > > > >
> > > > > On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <
> apurtell@apache.org
> > > <javascript:;>>
> > > > > wrote:
> > > > > >
> > > > > > I don't like the dropped writes either. Just pointing out what we
> > > have
> > > > > now.
> > > > > > There is a gap no doubt.
> > > > > >
> > > > > > On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <
> > ndimiduk@apache.org
> > > <javascript:;>>
> > > > > > wrote:
> > > > > > >
> > > > > > > Thanks for the reminder about the Multiplexer, Andrew. It
> sort-of
> > > > > solves
> > > > > > > this problem, but think it's semantics of dropping writes are
> not
> > > > > > desirable
> > > > > > > in the general case. Further, my understanding was that the new
> > > > > > connection
> > > > > > > implementation is designed to handle this kind of use-case
> (hence
> > > > > cc'ing
> > > > > > > Lars).
> > > > > > >
> > > > > > > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <
> > > > apurtell@apache.org <javascript:;>>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Aaron: Please post a copy of that feedback on the JIRA,
> pretty
> > > sure
> > > > > we
> > > > > > > will
> > > > > > > > be having an improvement discussion there.
> > > > > > > >
> > > > > > > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <
> > > > > abeppu@siftscience.com <javascript:;>>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Nick : Thanks, I've created an issue [1].
> > > > > > > > >
> > > > > > > > > Pradeep : Yes, I have considered using that. However for
> the
> > > > > moment,
> > > > > > > > we've
> > > > > > > > > set it out of scope, since our migration from 0.94 -> 0.98
> is
> > > > > > already a
> > > > > > > > bit
> > > > > > > > > complicated, and we hoped to separate isolate these changes
> > by
> > > > not
> > > > > > > moving
> > > > > > > > > to the async client until after the current migration is
> > > > complete.
> > > > > > > > >
> > > > > > > > > Andrew : HTableMultiplexer does seem like it would solve
> our
> > > > > buffered
> > > > > > > > write
> > > > > > > > > problem, albeit in an awkward way -- thanks! It kind of
> seems
> > > > like
> > > > > > > HTable
> > > > > > > > > should then (if autoFlush == false) send writes to the
> > > > multiplexer,
> > > > > > > > rather
> > > > > > > > > than setting it in its own, short-lived writeBuffer. If
> > nothing
> > > > > else,
> > > > > > > > it's
> > > > > > > > > still super confusing that HTableInterface exposes
> > > setAutoFlush()
> > > > > and
> > > > > > > > > setWriteBufferSize(), given that the writeBuffer won't
> > > > meaningfully
> > > > > > > > buffer
> > > > > > > > > anything if all tables are short-lived.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > > > > > > >
> > > > > > > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> > > > > > apurtell@apache.org <javascript:;>>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > I believe HTableMultiplexer[1] is meant to stand in for
> > > > > HTablePool
> > > > > > > for
> > > > > > > > > > buffered writing. FWIW, I've not used it.
> > > > > > > > > >
> > > > > > > > > > 1:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <
> > > > > ndimiduk@apache.org <javascript:;>
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Aaron,
> > > > > > > > > > >
> > > > > > > > > > > Your analysis is spot on and I do not believe this is
> by
> > > > > design.
> > > > > > I
> > > > > > > > see
> > > > > > > > > > the
> > > > > > > > > > > write buffer is owned by the table, while I would have
> > > > expected
> > > > > > > there
> > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > > a buffer per table all managed by the connection. I
> > suggest
> > > > you
> > > > > > > > raise a
> > > > > > > > > > > blocker ticket vs the 1.0.0 release that's just around
> > the
> > > > > corner
> > > > > > > to
> > > > > > > > > give
> > > > > > > > > > > this the attention it needs. Let me know if you're not
> > into
> > > > > > JIRA, I
> > > > > > > > can
> > > > > > > > > > > raise one on your behalf.
> > > > > > > > > > >
> > > > > > > > > > > cc Lars, Enis.
> > > > > > > > > > >
> > > > > > > > > > > Nice work Aaron.
> > > > > > > > > > > -n
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > > > > > > abeppu@siftscience.com <javascript:;>
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > > > TLDR; in the absence of HTablePool, if HTable
> instances
> > > are
> > > > > > > > > > short-lived,
> > > > > > > > > > > > how should clients use buffered writes?
> > > > > > > > > > > >
> > > > > > > > > > > > I’m working on migrating a codebase from using 0.94.6
> > > > > (CDH4.4)
> > > > > > to
> > > > > > > > > > 0.98.6
> > > > > > > > > > > > (CDH5.2). One issue I’m confused by is how to
> > effectively
> > > > use
> > > > > > > > > buffered
> > > > > > > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > > > > > > >
> > > > > > > > > > > > In our 0.94 code, a pathway could get a table from
> the
> > > > pool,
> > > > > > > > > configure
> > > > > > > > > > it
> > > > > > > > > > > > with table.setAutoFlush(false); and write Puts to it.
> > > Those
> > > > > > > writes
> > > > > > > > > > would
> > > > > > > > > > > > then go to the table instance’s writeBuffer, and
> those
> > > > writes
> > > > > > > would
> > > > > > > > > > only
> > > > > > > > > > > be
> > > > > > > > > > > > flushed when the buffer was full, or when we were
> ready
> > > to
> > > > > > close
> > > > > > > > out
> > > > > > > > > > the
> > > > > > > > > > > > pool. We were intentionally choosing to have fewer,
> > > larger
> > > > > > writes
> > > > > > > > > from
> > > > > > > > > > > the
> > > > > > > > > > > > client to the cluster, and we knew we were giving up
> a
> > > > degree
> > > > > > of
> > > > > > > > > safety
> > > > > > > > > > > in
> > > > > > > > > > > > exchange (i.e. if the client dies after it’s
> accepted a
> > > > write
> > > > > > but
> > > > > > > > > > before
> > > > > > > > > > > > the flush for that write occurs, the data is lost).
> > This
> > > > > seems
> > > > > > to
> > > > > > > > be
> > > > > > > > > a
> > > > > > > > > > > > generally considered a reasonable choice (cf the
> HBase
> > > Book
> > > > > [2]
> > > > > > > SS
> > > > > > > > > > > 14.8.4)
> > > > > > > > > > > >
> > > > > > > > > > > > However in the 0.98 world, without HTablePool, the
> > > endorsed
> > > > > > > pattern
> > > > > > > > > [3]
> > > > > > > > > > > > seems to be to create a new HTable via table =
> > > > > > > > > > > > stashedHConnection.getTable(tableName,
> > > myExecutorService).
> > > > > > > However,
> > > > > > > > > > even
> > > > > > > > > > > if
> > > > > > > > > > > > we do table.setAutoFlush(false), because that table
> > > > instance
> > > > > is
> > > > > > > > > > > > short-lived, its buffer never gets full. We’ll
> create a
> > > > table
> > > > > > > > > instance,
> > > > > > > > > > > > write a put to it, try to close the table, and the
> > close
> > > > call
> > > > > > > will
> > > > > > > > > > > trigger
> > > > > > > > > > > > a (synchronous) flush. Thus, not having HTablePool
> > seems
> > > > like
> > > > > > it
> > > > > > > > > would
> > > > > > > > > > > > cause us to have many more small writes from the
> client
> > > to
> > > > > the
> > > > > > > > > cluster,
> > > > > > > > > > > and
> > > > > > > > > > > > basically wipe out the advantage of turning off
> > > autoflush.
> > > > > > > > > > > >
> > > > > > > > > > > > More concretely :
> > > > > > > > > > > >
> > > > > > > > > > > > // Given these two helpers ...
> > > > > > > > > > > >
> > > > > > > > > > > > private HTableInterface getAutoFlushTable(String
> > > tableName)
> > > > > > > throws
> > > > > > > > > > > > IOException {
> > > > > > > > > > > >   // (autoflush is true by default)
> > > > > > > > > > > >   return storedConnection.getTable(tableName,
> > > > > executorService);
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > private HTableInterface getBufferedTable(String
> > > tableName)
> > > > > > throws
> > > > > > > > > > > > IOException {
> > > > > > > > > > > >   HTableInterface table =
> getAutoFlushTable(tableName);
> > > > > > > > > > > >   table.setAutoFlush(false);
> > > > > > > > > > > >   return table;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > // it's my contention that these two methods would
> > behave
> > > > > > almost
> > > > > > > > > > > > identically,
> > > > > > > > > > > > // except the first will hit a synchronous flush
> during
> > > the
> > > > > put
> > > > > > > > call,
> > > > > > > > > > > > and the second will
> > > > > > > > > > > > // flush during the (hidden) close call on table.
> > > > > > > > > > > >
> > > > > > > > > > > > private void writeAutoFlushed(Put somePut) throws
> > > > > IOException {
> > > > > > > > > > > >   try (HTableInterface table =
> > > > getAutoFlushTable(tableName))
> > > > > {
> > > > > > > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > > > > > > >   }
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > private void writeBuffered(Put somePut) throws
> > > IOException
> > > > {
> > > > > > > > > > > >   try (HTableInterface table =
> > > > getBufferedTable(tableName)) {
> > > > > > > > > > > >     table.put(somePut);
> > > > > > > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > It seems like the only way to avoid this is to have
> > > > > long-lived
> > > > > > > > HTable
> > > > > > > > > > > > instances, which get reused for multiple writes.
> > However,
> > > > > since
> > > > > > > the
> > > > > > > > > > > actual
> > > > > > > > > > > > writes are driven from highly concurrent code, and
> > since
> > > > > HTable
> > > > > > > is
> > > > > > > > > not
> > > > > > > > > > > > threadsafe, this would involve having a number of
> > HTable
> > > > > > > instances,
> > > > > > > > > > and a
> > > > > > > > > > > > control mechanism for leasing them out to individual
> > > > threads
> > > > > > > > safely.
> > > > > > > > > > > Except
> > > > > > > > > > > > at this point it seems like we will have recreated
> > > > > HTablePool,
> > > > > > > > which
> > > > > > > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > > > > > > >
> > > > > > > > > > > > What am I missing here? Since the
> > > > > HTableInterface.setAutoFlush
> > > > > > > > method
> > > > > > > > > > > still
> > > > > > > > > > > > exists, it must be anticipated that users will still
> > want
> > > > to
> > > > > > > buffer
> > > > > > > > > > > writes.
> > > > > > > > > > > > What’s the recommended way to actually buffer a
> > > meaningful
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > > writes, from a multithreaded context, that doesn’t
> just
> > > > > amount
> > > > > > to
> > > > > > > > > > > creating
> > > > > > > > > > > > a table pool?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > > Aaron
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > > > > > > [3]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > > > > > > 
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > >
> > > > > > > > > >    - Andy
> > > > > > > > > >
> > > > > > > > > > Problems worthy of attack prove their worth by hitting
> > back.
> > > -
> > > > > Piet
> > > > > > > > Hein
> > > > > > > > > > (via Tom White)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > >    - Andy
> > > > > > > >
> > > > > > > > Problems worthy of attack prove their worth by hitting back.
> -
> > > Piet
> > > > > > Hein
> > > > > > > > (via Tom White)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > >    - Andy
> > > > > >
> > > > > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > > > Hein
> > > > > > (via Tom White)
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Devaraja Swami <de...@gmail.com>.

Would like to add my perspective as a user. (Thanks to Aaron Beppu for
uncovering this hidden issue). In my applications, I have some tables for
which I need autoflushing, and others for which I need a write buffer. Plus
the size of the write buffer is different for different tables.
All these seem to imply that the HBase client side will need to maintain
and operate write buffers on a per-table basis, whether or not the
ephemeral Table/HTableInterface instances come and go (ie., are closed).
The question then, as Nick points out, is what entity is responsible for
flushing the buffers. By elimination, my feeling is that this would end up
being the Connection instance.

On Fri, Dec 19, 2014 at 5:55 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Could be in an API-compatible way, though semantics would change, which is
> probably worse. Table keeps these methods. When setAutoFlush is used, write
> buffer managed by connection is created. If multiple Table instances for
> the same table setWriteBufferSize(), perhaps the largest value wins. Writes
> across these instances all hit the same buffer. What's not clear here is
> who owns the ExecutorService(s) that handles flushing the buffer.
>
> My original thought was make this a blocker of 1.0, but we've shipped 0.96
> and 0.98 this way, so we have to keep API and semantics around for backward
> compatibility anyway. Doesn't mean we can't so the new API better though.
> HTablePool is still in 1.0, so this would be thinking ahead to the fancy
> new Table-based API. If we drop these two methods from Table, we can ship
> with a feature gap between old and new API, resolve this in 1.1. Folks who
> need this kind of pooling can continue to use HTablePool with HTables.
>
> On Friday, December 19, 2014, Solomon Duskis <sd...@gmail.com> wrote:
>
> > My first thought based on this discussion was that it would require
> moving
> > some methods (setAutoFlush() and setWriteBufferSize()) from Table to
> > Connection.  That would be a breaking API change.
> >
> > -Solomon
> >
> > On Fri, Dec 19, 2014 at 3:04 PM, Andrew Purtell <apurtell@apache.org
> > <javascript:;>> wrote:
> > >
> > > I think it would be critical if we're contemplating something that
> > requires
> > > a breaking API change? Do we have that here? I'm not sure.
> > >
> > > On Fri, Dec 19, 2014 at 12:02 PM, Solomon Duskis <sduskis@gmail.com
> > <javascript:;>>
> > > wrote:
> > > >
> > > > Is this critical to sort out before 1.0, or is fixing this a post-1.0
> > > > enhancement?
> > > >
> > > > -Solomon
> > > >
> > > > On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <apurtell@apache.org
> > <javascript:;>>
> > > > wrote:
> > > > >
> > > > > I don't like the dropped writes either. Just pointing out what we
> > have
> > > > now.
> > > > > There is a gap no doubt.
> > > > >
> > > > > On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <
> ndimiduk@apache.org
> > <javascript:;>>
> > > > > wrote:
> > > > > >
> > > > > > Thanks for the reminder about the Multiplexer, Andrew. It sort-of
> > > > solves
> > > > > > this problem, but think it's semantics of dropping writes are not
> > > > > desirable
> > > > > > in the general case. Further, my understanding was that the new
> > > > > connection
> > > > > > implementation is designed to handle this kind of use-case (hence
> > > > cc'ing
> > > > > > Lars).
> > > > > >
> > > > > > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <
> > > apurtell@apache.org <javascript:;>>
> > > > > > wrote:
> > > > > > >
> > > > > > > Aaron: Please post a copy of that feedback on the JIRA, pretty
> > sure
> > > > we
> > > > > > will
> > > > > > > be having an improvement discussion there.
> > > > > > >
> > > > > > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <
> > > > abeppu@siftscience.com <javascript:;>>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Nick : Thanks, I've created an issue [1].
> > > > > > > >
> > > > > > > > Pradeep : Yes, I have considered using that. However for the
> > > > moment,
> > > > > > > we've
> > > > > > > > set it out of scope, since our migration from 0.94 -> 0.98 is
> > > > > already a
> > > > > > > bit
> > > > > > > > complicated, and we hoped to separate isolate these changes
> by
> > > not
> > > > > > moving
> > > > > > > > to the async client until after the current migration is
> > > complete.
> > > > > > > >
> > > > > > > > Andrew : HTableMultiplexer does seem like it would solve our
> > > > buffered
> > > > > > > write
> > > > > > > > problem, albeit in an awkward way -- thanks! It kind of seems
> > > like
> > > > > > HTable
> > > > > > > > should then (if autoFlush == false) send writes to the
> > > multiplexer,
> > > > > > > rather
> > > > > > > > than setting it in its own, short-lived writeBuffer. If
> nothing
> > > > else,
> > > > > > > it's
> > > > > > > > still super confusing that HTableInterface exposes
> > setAutoFlush()
> > > > and
> > > > > > > > setWriteBufferSize(), given that the writeBuffer won't
> > > meaningfully
> > > > > > > buffer
> > > > > > > > anything if all tables are short-lived.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > > > > > >
> > > > > > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> > > > > apurtell@apache.org <javascript:;>>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > I believe HTableMultiplexer[1] is meant to stand in for
> > > > HTablePool
> > > > > > for
> > > > > > > > > buffered writing. FWIW, I've not used it.
> > > > > > > > >
> > > > > > > > > 1:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <
> > > > ndimiduk@apache.org <javascript:;>
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Aaron,
> > > > > > > > > >
> > > > > > > > > > Your analysis is spot on and I do not believe this is by
> > > > design.
> > > > > I
> > > > > > > see
> > > > > > > > > the
> > > > > > > > > > write buffer is owned by the table, while I would have
> > > expected
> > > > > > there
> > > > > > > > to
> > > > > > > > > be
> > > > > > > > > > a buffer per table all managed by the connection. I
> suggest
> > > you
> > > > > > > raise a
> > > > > > > > > > blocker ticket vs the 1.0.0 release that's just around
> the
> > > > corner
> > > > > > to
> > > > > > > > give
> > > > > > > > > > this the attention it needs. Let me know if you're not
> into
> > > > > JIRA, I
> > > > > > > can
> > > > > > > > > > raise one on your behalf.
> > > > > > > > > >
> > > > > > > > > > cc Lars, Enis.
> > > > > > > > > >
> > > > > > > > > > Nice work Aaron.
> > > > > > > > > > -n
> > > > > > > > > >
> > > > > > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > > > > > abeppu@siftscience.com <javascript:;>
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > TLDR; in the absence of HTablePool, if HTable instances
> > are
> > > > > > > > > short-lived,
> > > > > > > > > > > how should clients use buffered writes?
> > > > > > > > > > >
> > > > > > > > > > > I’m working on migrating a codebase from using 0.94.6
> > > > (CDH4.4)
> > > > > to
> > > > > > > > > 0.98.6
> > > > > > > > > > > (CDH5.2). One issue I’m confused by is how to
> effectively
> > > use
> > > > > > > > buffered
> > > > > > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > > > > > >
> > > > > > > > > > > In our 0.94 code, a pathway could get a table from the
> > > pool,
> > > > > > > > configure
> > > > > > > > > it
> > > > > > > > > > > with table.setAutoFlush(false); and write Puts to it.
> > Those
> > > > > > writes
> > > > > > > > > would
> > > > > > > > > > > then go to the table instance’s writeBuffer, and those
> > > writes
> > > > > > would
> > > > > > > > > only
> > > > > > > > > > be
> > > > > > > > > > > flushed when the buffer was full, or when we were ready
> > to
> > > > > close
> > > > > > > out
> > > > > > > > > the
> > > > > > > > > > > pool. We were intentionally choosing to have fewer,
> > larger
> > > > > writes
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > client to the cluster, and we knew we were giving up a
> > > degree
> > > > > of
> > > > > > > > safety
> > > > > > > > > > in
> > > > > > > > > > > exchange (i.e. if the client dies after it’s accepted a
> > > write
> > > > > but
> > > > > > > > > before
> > > > > > > > > > > the flush for that write occurs, the data is lost).
> This
> > > > seems
> > > > > to
> > > > > > > be
> > > > > > > > a
> > > > > > > > > > > generally considered a reasonable choice (cf the HBase
> > Book
> > > > [2]
> > > > > > SS
> > > > > > > > > > 14.8.4)
> > > > > > > > > > >
> > > > > > > > > > > However in the 0.98 world, without HTablePool, the
> > endorsed
> > > > > > pattern
> > > > > > > > [3]
> > > > > > > > > > > seems to be to create a new HTable via table =
> > > > > > > > > > > stashedHConnection.getTable(tableName,
> > myExecutorService).
> > > > > > However,
> > > > > > > > > even
> > > > > > > > > > if
> > > > > > > > > > > we do table.setAutoFlush(false), because that table
> > > instance
> > > > is
> > > > > > > > > > > short-lived, its buffer never gets full. We’ll create a
> > > table
> > > > > > > > instance,
> > > > > > > > > > > write a put to it, try to close the table, and the
> close
> > > call
> > > > > > will
> > > > > > > > > > trigger
> > > > > > > > > > > a (synchronous) flush. Thus, not having HTablePool
> seems
> > > like
> > > > > it
> > > > > > > > would
> > > > > > > > > > > cause us to have many more small writes from the client
> > to
> > > > the
> > > > > > > > cluster,
> > > > > > > > > > and
> > > > > > > > > > > basically wipe out the advantage of turning off
> > autoflush.
> > > > > > > > > > >
> > > > > > > > > > > More concretely :
> > > > > > > > > > >
> > > > > > > > > > > // Given these two helpers ...
> > > > > > > > > > >
> > > > > > > > > > > private HTableInterface getAutoFlushTable(String
> > tableName)
> > > > > > throws
> > > > > > > > > > > IOException {
> > > > > > > > > > >   // (autoflush is true by default)
> > > > > > > > > > >   return storedConnection.getTable(tableName,
> > > > executorService);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > private HTableInterface getBufferedTable(String
> > tableName)
> > > > > throws
> > > > > > > > > > > IOException {
> > > > > > > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > > > > > > >   table.setAutoFlush(false);
> > > > > > > > > > >   return table;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > // it's my contention that these two methods would
> behave
> > > > > almost
> > > > > > > > > > > identically,
> > > > > > > > > > > // except the first will hit a synchronous flush during
> > the
> > > > put
> > > > > > > call,
> > > > > > > > > > > and the second will
> > > > > > > > > > > // flush during the (hidden) close call on table.
> > > > > > > > > > >
> > > > > > > > > > > private void writeAutoFlushed(Put somePut) throws
> > > > IOException {
> > > > > > > > > > >   try (HTableInterface table =
> > > getAutoFlushTable(tableName))
> > > > {
> > > > > > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > > > > > >   }
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > private void writeBuffered(Put somePut) throws
> > IOException
> > > {
> > > > > > > > > > >   try (HTableInterface table =
> > > getBufferedTable(tableName)) {
> > > > > > > > > > >     table.put(somePut);
> > > > > > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > It seems like the only way to avoid this is to have
> > > > long-lived
> > > > > > > HTable
> > > > > > > > > > > instances, which get reused for multiple writes.
> However,
> > > > since
> > > > > > the
> > > > > > > > > > actual
> > > > > > > > > > > writes are driven from highly concurrent code, and
> since
> > > > HTable
> > > > > > is
> > > > > > > > not
> > > > > > > > > > > threadsafe, this would involve having a number of
> HTable
> > > > > > instances,
> > > > > > > > > and a
> > > > > > > > > > > control mechanism for leasing them out to individual
> > > threads
> > > > > > > safely.
> > > > > > > > > > Except
> > > > > > > > > > > at this point it seems like we will have recreated
> > > > HTablePool,
> > > > > > > which
> > > > > > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > > > > > >
> > > > > > > > > > > What am I missing here? Since the
> > > > HTableInterface.setAutoFlush
> > > > > > > method
> > > > > > > > > > still
> > > > > > > > > > > exists, it must be anticipated that users will still
> want
> > > to
> > > > > > buffer
> > > > > > > > > > writes.
> > > > > > > > > > > What’s the recommended way to actually buffer a
> > meaningful
> > > > > number
> > > > > > > of
> > > > > > > > > > > writes, from a multithreaded context, that doesn’t just
> > > > amount
> > > > > to
> > > > > > > > > > creating
> > > > > > > > > > > a table pool?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > Aaron
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > > > > > 
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > >    - Andy
> > > > > > > > >
> > > > > > > > > Problems worthy of attack prove their worth by hitting
> back.
> > -
> > > > Piet
> > > > > > > Hein
> > > > > > > > > (via Tom White)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > >    - Andy
> > > > > > >
> > > > > > > Problems worthy of attack prove their worth by hitting back. -
> > Piet
> > > > > Hein
> > > > > > > (via Tom White)
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein
> > > > > (via Tom White)
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Nick Dimiduk <nd...@gmail.com>.

Could be in an API-compatible way, though semantics would change, which is
probably worse. Table keeps these methods. When setAutoFlush is used, write
buffer managed by connection is created. If multiple Table instances for
the same table setWriteBufferSize(), perhaps the largest value wins. Writes
across these instances all hit the same buffer. What's not clear here is
who owns the ExecutorService(s) that handles flushing the buffer.

My original thought was make this a blocker of 1.0, but we've shipped 0.96
and 0.98 this way, so we have to keep API and semantics around for backward
compatibility anyway. Doesn't mean we can't so the new API better though.
HTablePool is still in 1.0, so this would be thinking ahead to the fancy
new Table-based API. If we drop these two methods from Table, we can ship
with a feature gap between old and new API, resolve this in 1.1. Folks who
need this kind of pooling can continue to use HTablePool with HTables.

On Friday, December 19, 2014, Solomon Duskis <sd...@gmail.com> wrote:

> My first thought based on this discussion was that it would require moving
> some methods (setAutoFlush() and setWriteBufferSize()) from Table to
> Connection.  That would be a breaking API change.
>
> -Solomon
>
> On Fri, Dec 19, 2014 at 3:04 PM, Andrew Purtell <apurtell@apache.org
> <javascript:;>> wrote:
> >
> > I think it would be critical if we're contemplating something that
> requires
> > a breaking API change? Do we have that here? I'm not sure.
> >
> > On Fri, Dec 19, 2014 at 12:02 PM, Solomon Duskis <sduskis@gmail.com
> <javascript:;>>
> > wrote:
> > >
> > > Is this critical to sort out before 1.0, or is fixing this a post-1.0
> > > enhancement?
> > >
> > > -Solomon
> > >
> > > On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <apurtell@apache.org
> <javascript:;>>
> > > wrote:
> > > >
> > > > I don't like the dropped writes either. Just pointing out what we
> have
> > > now.
> > > > There is a gap no doubt.
> > > >
> > > > On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <ndimiduk@apache.org
> <javascript:;>>
> > > > wrote:
> > > > >
> > > > > Thanks for the reminder about the Multiplexer, Andrew. It sort-of
> > > solves
> > > > > this problem, but think it's semantics of dropping writes are not
> > > > desirable
> > > > > in the general case. Further, my understanding was that the new
> > > > connection
> > > > > implementation is designed to handle this kind of use-case (hence
> > > cc'ing
> > > > > Lars).
> > > > >
> > > > > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <
> > apurtell@apache.org <javascript:;>>
> > > > > wrote:
> > > > > >
> > > > > > Aaron: Please post a copy of that feedback on the JIRA, pretty
> sure
> > > we
> > > > > will
> > > > > > be having an improvement discussion there.
> > > > > >
> > > > > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <
> > > abeppu@siftscience.com <javascript:;>>
> > > > > > wrote:
> > > > > > >
> > > > > > > Nick : Thanks, I've created an issue [1].
> > > > > > >
> > > > > > > Pradeep : Yes, I have considered using that. However for the
> > > moment,
> > > > > > we've
> > > > > > > set it out of scope, since our migration from 0.94 -> 0.98 is
> > > > already a
> > > > > > bit
> > > > > > > complicated, and we hoped to separate isolate these changes by
> > not
> > > > > moving
> > > > > > > to the async client until after the current migration is
> > complete.
> > > > > > >
> > > > > > > Andrew : HTableMultiplexer does seem like it would solve our
> > > buffered
> > > > > > write
> > > > > > > problem, albeit in an awkward way -- thanks! It kind of seems
> > like
> > > > > HTable
> > > > > > > should then (if autoFlush == false) send writes to the
> > multiplexer,
> > > > > > rather
> > > > > > > than setting it in its own, short-lived writeBuffer. If nothing
> > > else,
> > > > > > it's
> > > > > > > still super confusing that HTableInterface exposes
> setAutoFlush()
> > > and
> > > > > > > setWriteBufferSize(), given that the writeBuffer won't
> > meaningfully
> > > > > > buffer
> > > > > > > anything if all tables are short-lived.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > > > > >
> > > > > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> > > > apurtell@apache.org <javascript:;>>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > I believe HTableMultiplexer[1] is meant to stand in for
> > > HTablePool
> > > > > for
> > > > > > > > buffered writing. FWIW, I've not used it.
> > > > > > > >
> > > > > > > > 1:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <
> > > ndimiduk@apache.org <javascript:;>
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Aaron,
> > > > > > > > >
> > > > > > > > > Your analysis is spot on and I do not believe this is by
> > > design.
> > > > I
> > > > > > see
> > > > > > > > the
> > > > > > > > > write buffer is owned by the table, while I would have
> > expected
> > > > > there
> > > > > > > to
> > > > > > > > be
> > > > > > > > > a buffer per table all managed by the connection. I suggest
> > you
> > > > > > raise a
> > > > > > > > > blocker ticket vs the 1.0.0 release that's just around the
> > > corner
> > > > > to
> > > > > > > give
> > > > > > > > > this the attention it needs. Let me know if you're not into
> > > > JIRA, I
> > > > > > can
> > > > > > > > > raise one on your behalf.
> > > > > > > > >
> > > > > > > > > cc Lars, Enis.
> > > > > > > > >
> > > > > > > > > Nice work Aaron.
> > > > > > > > > -n
> > > > > > > > >
> > > > > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > > > > abeppu@siftscience.com <javascript:;>
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > TLDR; in the absence of HTablePool, if HTable instances
> are
> > > > > > > > short-lived,
> > > > > > > > > > how should clients use buffered writes?
> > > > > > > > > >
> > > > > > > > > > I’m working on migrating a codebase from using 0.94.6
> > > (CDH4.4)
> > > > to
> > > > > > > > 0.98.6
> > > > > > > > > > (CDH5.2). One issue I’m confused by is how to effectively
> > use
> > > > > > > buffered
> > > > > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > > > > >
> > > > > > > > > > In our 0.94 code, a pathway could get a table from the
> > pool,
> > > > > > > configure
> > > > > > > > it
> > > > > > > > > > with table.setAutoFlush(false); and write Puts to it.
> Those
> > > > > writes
> > > > > > > > would
> > > > > > > > > > then go to the table instance’s writeBuffer, and those
> > writes
> > > > > would
> > > > > > > > only
> > > > > > > > > be
> > > > > > > > > > flushed when the buffer was full, or when we were ready
> to
> > > > close
> > > > > > out
> > > > > > > > the
> > > > > > > > > > pool. We were intentionally choosing to have fewer,
> larger
> > > > writes
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > client to the cluster, and we knew we were giving up a
> > degree
> > > > of
> > > > > > > safety
> > > > > > > > > in
> > > > > > > > > > exchange (i.e. if the client dies after it’s accepted a
> > write
> > > > but
> > > > > > > > before
> > > > > > > > > > the flush for that write occurs, the data is lost). This
> > > seems
> > > > to
> > > > > > be
> > > > > > > a
> > > > > > > > > > generally considered a reasonable choice (cf the HBase
> Book
> > > [2]
> > > > > SS
> > > > > > > > > 14.8.4)
> > > > > > > > > >
> > > > > > > > > > However in the 0.98 world, without HTablePool, the
> endorsed
> > > > > pattern
> > > > > > > [3]
> > > > > > > > > > seems to be to create a new HTable via table =
> > > > > > > > > > stashedHConnection.getTable(tableName,
> myExecutorService).
> > > > > However,
> > > > > > > > even
> > > > > > > > > if
> > > > > > > > > > we do table.setAutoFlush(false), because that table
> > instance
> > > is
> > > > > > > > > > short-lived, its buffer never gets full. We’ll create a
> > table
> > > > > > > instance,
> > > > > > > > > > write a put to it, try to close the table, and the close
> > call
> > > > > will
> > > > > > > > > trigger
> > > > > > > > > > a (synchronous) flush. Thus, not having HTablePool seems
> > like
> > > > it
> > > > > > > would
> > > > > > > > > > cause us to have many more small writes from the client
> to
> > > the
> > > > > > > cluster,
> > > > > > > > > and
> > > > > > > > > > basically wipe out the advantage of turning off
> autoflush.
> > > > > > > > > >
> > > > > > > > > > More concretely :
> > > > > > > > > >
> > > > > > > > > > // Given these two helpers ...
> > > > > > > > > >
> > > > > > > > > > private HTableInterface getAutoFlushTable(String
> tableName)
> > > > > throws
> > > > > > > > > > IOException {
> > > > > > > > > >   // (autoflush is true by default)
> > > > > > > > > >   return storedConnection.getTable(tableName,
> > > executorService);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > private HTableInterface getBufferedTable(String
> tableName)
> > > > throws
> > > > > > > > > > IOException {
> > > > > > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > > > > > >   table.setAutoFlush(false);
> > > > > > > > > >   return table;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > // it's my contention that these two methods would behave
> > > > almost
> > > > > > > > > > identically,
> > > > > > > > > > // except the first will hit a synchronous flush during
> the
> > > put
> > > > > > call,
> > > > > > > > > > and the second will
> > > > > > > > > > // flush during the (hidden) close call on table.
> > > > > > > > > >
> > > > > > > > > > private void writeAutoFlushed(Put somePut) throws
> > > IOException {
> > > > > > > > > >   try (HTableInterface table =
> > getAutoFlushTable(tableName))
> > > {
> > > > > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > > > > >   }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > private void writeBuffered(Put somePut) throws
> IOException
> > {
> > > > > > > > > >   try (HTableInterface table =
> > getBufferedTable(tableName)) {
> > > > > > > > > >     table.put(somePut);
> > > > > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > It seems like the only way to avoid this is to have
> > > long-lived
> > > > > > HTable
> > > > > > > > > > instances, which get reused for multiple writes. However,
> > > since
> > > > > the
> > > > > > > > > actual
> > > > > > > > > > writes are driven from highly concurrent code, and since
> > > HTable
> > > > > is
> > > > > > > not
> > > > > > > > > > threadsafe, this would involve having a number of HTable
> > > > > instances,
> > > > > > > > and a
> > > > > > > > > > control mechanism for leasing them out to individual
> > threads
> > > > > > safely.
> > > > > > > > > Except
> > > > > > > > > > at this point it seems like we will have recreated
> > > HTablePool,
> > > > > > which
> > > > > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > > > > >
> > > > > > > > > > What am I missing here? Since the
> > > HTableInterface.setAutoFlush
> > > > > > method
> > > > > > > > > still
> > > > > > > > > > exists, it must be anticipated that users will still want
> > to
> > > > > buffer
> > > > > > > > > writes.
> > > > > > > > > > What’s the recommended way to actually buffer a
> meaningful
> > > > number
> > > > > > of
> > > > > > > > > > writes, from a multithreaded context, that doesn’t just
> > > amount
> > > > to
> > > > > > > > > creating
> > > > > > > > > > a table pool?
> > > > > > > > > >
> > > > > > > > > > Thanks in advance,
> > > > > > > > > > Aaron
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > > > > 
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > >    - Andy
> > > > > > > >
> > > > > > > > Problems worthy of attack prove their worth by hitting back.
> -
> > > Piet
> > > > > > Hein
> > > > > > > > (via Tom White)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > >    - Andy
> > > > > >
> > > > > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > > > Hein
> > > > > > (via Tom White)
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Stack <st...@duboce.net>.

On Fri, Dec 19, 2014 at 12:20 PM, Solomon Duskis <sd...@gmail.com> wrote:

> My first thought based on this discussion was that it would require moving
> some methods (setAutoFlush() and setWriteBufferSize()) from Table to
> Connection.  That would be a breaking API change.
>
>
This will mean a bunch of Table state tracking down in Connection?  We want
to do that?

What if there were a thread-safe implementation of Table?

Otherwise, a new version of ThreadPool, one that shares an Executor and
returns a Table, cleanly documented as the salve for the problem Aaron
raises? (Lars Hofhansl should sketch the design since he is the original
hater of the old implementation -- smile -- and then one of us can do the
new implementation).

St.Ack
P.S. HTableMultiplexer is from another era so I can believe it an ill-fit.




> -Solomon
>
> On Fri, Dec 19, 2014 at 3:04 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > I think it would be critical if we're contemplating something that
> requires
> > a breaking API change? Do we have that here? I'm not sure.
> >
> > On Fri, Dec 19, 2014 at 12:02 PM, Solomon Duskis <sd...@gmail.com>
> > wrote:
> > >
> > > Is this critical to sort out before 1.0, or is fixing this a post-1.0
> > > enhancement?
> > >
> > > -Solomon
> > >
> > > On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > > >
> > > > I don't like the dropped writes either. Just pointing out what we
> have
> > > now.
> > > > There is a gap no doubt.
> > > >
> > > > On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <nd...@apache.org>
> > > > wrote:
> > > > >
> > > > > Thanks for the reminder about the Multiplexer, Andrew. It sort-of
> > > solves
> > > > > this problem, but think it's semantics of dropping writes are not
> > > > desirable
> > > > > in the general case. Further, my understanding was that the new
> > > > connection
> > > > > implementation is designed to handle this kind of use-case (hence
> > > cc'ing
> > > > > Lars).
> > > > >
> > > > > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <
> > apurtell@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > Aaron: Please post a copy of that feedback on the JIRA, pretty
> sure
> > > we
> > > > > will
> > > > > > be having an improvement discussion there.
> > > > > >
> > > > > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <
> > > abeppu@siftscience.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Nick : Thanks, I've created an issue [1].
> > > > > > >
> > > > > > > Pradeep : Yes, I have considered using that. However for the
> > > moment,
> > > > > > we've
> > > > > > > set it out of scope, since our migration from 0.94 -> 0.98 is
> > > > already a
> > > > > > bit
> > > > > > > complicated, and we hoped to separate isolate these changes by
> > not
> > > > > moving
> > > > > > > to the async client until after the current migration is
> > complete.
> > > > > > >
> > > > > > > Andrew : HTableMultiplexer does seem like it would solve our
> > > buffered
> > > > > > write
> > > > > > > problem, albeit in an awkward way -- thanks! It kind of seems
> > like
> > > > > HTable
> > > > > > > should then (if autoFlush == false) send writes to the
> > multiplexer,
> > > > > > rather
> > > > > > > than setting it in its own, short-lived writeBuffer. If nothing
> > > else,
> > > > > > it's
> > > > > > > still super confusing that HTableInterface exposes
> setAutoFlush()
> > > and
> > > > > > > setWriteBufferSize(), given that the writeBuffer won't
> > meaningfully
> > > > > > buffer
> > > > > > > anything if all tables are short-lived.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > > > > >
> > > > > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> > > > apurtell@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > I believe HTableMultiplexer[1] is meant to stand in for
> > > HTablePool
> > > > > for
> > > > > > > > buffered writing. FWIW, I've not used it.
> > > > > > > >
> > > > > > > > 1:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <
> > > ndimiduk@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Aaron,
> > > > > > > > >
> > > > > > > > > Your analysis is spot on and I do not believe this is by
> > > design.
> > > > I
> > > > > > see
> > > > > > > > the
> > > > > > > > > write buffer is owned by the table, while I would have
> > expected
> > > > > there
> > > > > > > to
> > > > > > > > be
> > > > > > > > > a buffer per table all managed by the connection. I suggest
> > you
> > > > > > raise a
> > > > > > > > > blocker ticket vs the 1.0.0 release that's just around the
> > > corner
> > > > > to
> > > > > > > give
> > > > > > > > > this the attention it needs. Let me know if you're not into
> > > > JIRA, I
> > > > > > can
> > > > > > > > > raise one on your behalf.
> > > > > > > > >
> > > > > > > > > cc Lars, Enis.
> > > > > > > > >
> > > > > > > > > Nice work Aaron.
> > > > > > > > > -n
> > > > > > > > >
> > > > > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > > > > abeppu@siftscience.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > TLDR; in the absence of HTablePool, if HTable instances
> are
> > > > > > > > short-lived,
> > > > > > > > > > how should clients use buffered writes?
> > > > > > > > > >
> > > > > > > > > > I’m working on migrating a codebase from using 0.94.6
> > > (CDH4.4)
> > > > to
> > > > > > > > 0.98.6
> > > > > > > > > > (CDH5.2). One issue I’m confused by is how to effectively
> > use
> > > > > > > buffered
> > > > > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > > > > >
> > > > > > > > > > In our 0.94 code, a pathway could get a table from the
> > pool,
> > > > > > > configure
> > > > > > > > it
> > > > > > > > > > with table.setAutoFlush(false); and write Puts to it.
> Those
> > > > > writes
> > > > > > > > would
> > > > > > > > > > then go to the table instance’s writeBuffer, and those
> > writes
> > > > > would
> > > > > > > > only
> > > > > > > > > be
> > > > > > > > > > flushed when the buffer was full, or when we were ready
> to
> > > > close
> > > > > > out
> > > > > > > > the
> > > > > > > > > > pool. We were intentionally choosing to have fewer,
> larger
> > > > writes
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > client to the cluster, and we knew we were giving up a
> > degree
> > > > of
> > > > > > > safety
> > > > > > > > > in
> > > > > > > > > > exchange (i.e. if the client dies after it’s accepted a
> > write
> > > > but
> > > > > > > > before
> > > > > > > > > > the flush for that write occurs, the data is lost). This
> > > seems
> > > > to
> > > > > > be
> > > > > > > a
> > > > > > > > > > generally considered a reasonable choice (cf the HBase
> Book
> > > [2]
> > > > > SS
> > > > > > > > > 14.8.4)
> > > > > > > > > >
> > > > > > > > > > However in the 0.98 world, without HTablePool, the
> endorsed
> > > > > pattern
> > > > > > > [3]
> > > > > > > > > > seems to be to create a new HTable via table =
> > > > > > > > > > stashedHConnection.getTable(tableName,
> myExecutorService).
> > > > > However,
> > > > > > > > even
> > > > > > > > > if
> > > > > > > > > > we do table.setAutoFlush(false), because that table
> > instance
> > > is
> > > > > > > > > > short-lived, its buffer never gets full. We’ll create a
> > table
> > > > > > > instance,
> > > > > > > > > > write a put to it, try to close the table, and the close
> > call
> > > > > will
> > > > > > > > > trigger
> > > > > > > > > > a (synchronous) flush. Thus, not having HTablePool seems
> > like
> > > > it
> > > > > > > would
> > > > > > > > > > cause us to have many more small writes from the client
> to
> > > the
> > > > > > > cluster,
> > > > > > > > > and
> > > > > > > > > > basically wipe out the advantage of turning off
> autoflush.
> > > > > > > > > >
> > > > > > > > > > More concretely :
> > > > > > > > > >
> > > > > > > > > > // Given these two helpers ...
> > > > > > > > > >
> > > > > > > > > > private HTableInterface getAutoFlushTable(String
> tableName)
> > > > > throws
> > > > > > > > > > IOException {
> > > > > > > > > >   // (autoflush is true by default)
> > > > > > > > > >   return storedConnection.getTable(tableName,
> > > executorService);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > private HTableInterface getBufferedTable(String
> tableName)
> > > > throws
> > > > > > > > > > IOException {
> > > > > > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > > > > > >   table.setAutoFlush(false);
> > > > > > > > > >   return table;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > // it's my contention that these two methods would behave
> > > > almost
> > > > > > > > > > identically,
> > > > > > > > > > // except the first will hit a synchronous flush during
> the
> > > put
> > > > > > call,
> > > > > > > > > > and the second will
> > > > > > > > > > // flush during the (hidden) close call on table.
> > > > > > > > > >
> > > > > > > > > > private void writeAutoFlushed(Put somePut) throws
> > > IOException {
> > > > > > > > > >   try (HTableInterface table =
> > getAutoFlushTable(tableName))
> > > {
> > > > > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > > > > >   }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > private void writeBuffered(Put somePut) throws
> IOException
> > {
> > > > > > > > > >   try (HTableInterface table =
> > getBufferedTable(tableName)) {
> > > > > > > > > >     table.put(somePut);
> > > > > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > It seems like the only way to avoid this is to have
> > > long-lived
> > > > > > HTable
> > > > > > > > > > instances, which get reused for multiple writes. However,
> > > since
> > > > > the
> > > > > > > > > actual
> > > > > > > > > > writes are driven from highly concurrent code, and since
> > > HTable
> > > > > is
> > > > > > > not
> > > > > > > > > > threadsafe, this would involve having a number of HTable
> > > > > instances,
> > > > > > > > and a
> > > > > > > > > > control mechanism for leasing them out to individual
> > threads
> > > > > > safely.
> > > > > > > > > Except
> > > > > > > > > > at this point it seems like we will have recreated
> > > HTablePool,
> > > > > > which
> > > > > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > > > > >
> > > > > > > > > > What am I missing here? Since the
> > > HTableInterface.setAutoFlush
> > > > > > method
> > > > > > > > > still
> > > > > > > > > > exists, it must be anticipated that users will still want
> > to
> > > > > buffer
> > > > > > > > > writes.
> > > > > > > > > > What’s the recommended way to actually buffer a
> meaningful
> > > > number
> > > > > > of
> > > > > > > > > > writes, from a multithreaded context, that doesn’t just
> > > amount
> > > > to
> > > > > > > > > creating
> > > > > > > > > > a table pool?
> > > > > > > > > >
> > > > > > > > > > Thanks in advance,
> > > > > > > > > > Aaron
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > > > > 
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > >    - Andy
> > > > > > > >
> > > > > > > > Problems worthy of attack prove their worth by hitting back.
> -
> > > Piet
> > > > > > Hein
> > > > > > > > (via Tom White)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > >    - Andy
> > > > > >
> > > > > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > > > Hein
> > > > > > (via Tom White)
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Solomon Duskis <sd...@gmail.com>.

My first thought based on this discussion was that it would require moving
some methods (setAutoFlush() and setWriteBufferSize()) from Table to
Connection.  That would be a breaking API change.

-Solomon

On Fri, Dec 19, 2014 at 3:04 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> I think it would be critical if we're contemplating something that requires
> a breaking API change? Do we have that here? I'm not sure.
>
> On Fri, Dec 19, 2014 at 12:02 PM, Solomon Duskis <sd...@gmail.com>
> wrote:
> >
> > Is this critical to sort out before 1.0, or is fixing this a post-1.0
> > enhancement?
> >
> > -Solomon
> >
> > On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <ap...@apache.org>
> > wrote:
> > >
> > > I don't like the dropped writes either. Just pointing out what we have
> > now.
> > > There is a gap no doubt.
> > >
> > > On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <nd...@apache.org>
> > > wrote:
> > > >
> > > > Thanks for the reminder about the Multiplexer, Andrew. It sort-of
> > solves
> > > > this problem, but think it's semantics of dropping writes are not
> > > desirable
> > > > in the general case. Further, my understanding was that the new
> > > connection
> > > > implementation is designed to handle this kind of use-case (hence
> > cc'ing
> > > > Lars).
> > > >
> > > > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <
> apurtell@apache.org>
> > > > wrote:
> > > > >
> > > > > Aaron: Please post a copy of that feedback on the JIRA, pretty sure
> > we
> > > > will
> > > > > be having an improvement discussion there.
> > > > >
> > > > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <
> > abeppu@siftscience.com>
> > > > > wrote:
> > > > > >
> > > > > > Nick : Thanks, I've created an issue [1].
> > > > > >
> > > > > > Pradeep : Yes, I have considered using that. However for the
> > moment,
> > > > > we've
> > > > > > set it out of scope, since our migration from 0.94 -> 0.98 is
> > > already a
> > > > > bit
> > > > > > complicated, and we hoped to separate isolate these changes by
> not
> > > > moving
> > > > > > to the async client until after the current migration is
> complete.
> > > > > >
> > > > > > Andrew : HTableMultiplexer does seem like it would solve our
> > buffered
> > > > > write
> > > > > > problem, albeit in an awkward way -- thanks! It kind of seems
> like
> > > > HTable
> > > > > > should then (if autoFlush == false) send writes to the
> multiplexer,
> > > > > rather
> > > > > > than setting it in its own, short-lived writeBuffer. If nothing
> > else,
> > > > > it's
> > > > > > still super confusing that HTableInterface exposes setAutoFlush()
> > and
> > > > > > setWriteBufferSize(), given that the writeBuffer won't
> meaningfully
> > > > > buffer
> > > > > > anything if all tables are short-lived.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > > > >
> > > > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> > > apurtell@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > I believe HTableMultiplexer[1] is meant to stand in for
> > HTablePool
> > > > for
> > > > > > > buffered writing. FWIW, I've not used it.
> > > > > > >
> > > > > > > 1:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <
> > ndimiduk@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Aaron,
> > > > > > > >
> > > > > > > > Your analysis is spot on and I do not believe this is by
> > design.
> > > I
> > > > > see
> > > > > > > the
> > > > > > > > write buffer is owned by the table, while I would have
> expected
> > > > there
> > > > > > to
> > > > > > > be
> > > > > > > > a buffer per table all managed by the connection. I suggest
> you
> > > > > raise a
> > > > > > > > blocker ticket vs the 1.0.0 release that's just around the
> > corner
> > > > to
> > > > > > give
> > > > > > > > this the attention it needs. Let me know if you're not into
> > > JIRA, I
> > > > > can
> > > > > > > > raise one on your behalf.
> > > > > > > >
> > > > > > > > cc Lars, Enis.
> > > > > > > >
> > > > > > > > Nice work Aaron.
> > > > > > > > -n
> > > > > > > >
> > > > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > > > abeppu@siftscience.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > TLDR; in the absence of HTablePool, if HTable instances are
> > > > > > > short-lived,
> > > > > > > > > how should clients use buffered writes?
> > > > > > > > >
> > > > > > > > > I’m working on migrating a codebase from using 0.94.6
> > (CDH4.4)
> > > to
> > > > > > > 0.98.6
> > > > > > > > > (CDH5.2). One issue I’m confused by is how to effectively
> use
> > > > > > buffered
> > > > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > > > >
> > > > > > > > > In our 0.94 code, a pathway could get a table from the
> pool,
> > > > > > configure
> > > > > > > it
> > > > > > > > > with table.setAutoFlush(false); and write Puts to it. Those
> > > > writes
> > > > > > > would
> > > > > > > > > then go to the table instance’s writeBuffer, and those
> writes
> > > > would
> > > > > > > only
> > > > > > > > be
> > > > > > > > > flushed when the buffer was full, or when we were ready to
> > > close
> > > > > out
> > > > > > > the
> > > > > > > > > pool. We were intentionally choosing to have fewer, larger
> > > writes
> > > > > > from
> > > > > > > > the
> > > > > > > > > client to the cluster, and we knew we were giving up a
> degree
> > > of
> > > > > > safety
> > > > > > > > in
> > > > > > > > > exchange (i.e. if the client dies after it’s accepted a
> write
> > > but
> > > > > > > before
> > > > > > > > > the flush for that write occurs, the data is lost). This
> > seems
> > > to
> > > > > be
> > > > > > a
> > > > > > > > > generally considered a reasonable choice (cf the HBase Book
> > [2]
> > > > SS
> > > > > > > > 14.8.4)
> > > > > > > > >
> > > > > > > > > However in the 0.98 world, without HTablePool, the endorsed
> > > > pattern
> > > > > > [3]
> > > > > > > > > seems to be to create a new HTable via table =
> > > > > > > > > stashedHConnection.getTable(tableName, myExecutorService).
> > > > However,
> > > > > > > even
> > > > > > > > if
> > > > > > > > > we do table.setAutoFlush(false), because that table
> instance
> > is
> > > > > > > > > short-lived, its buffer never gets full. We’ll create a
> table
> > > > > > instance,
> > > > > > > > > write a put to it, try to close the table, and the close
> call
> > > > will
> > > > > > > > trigger
> > > > > > > > > a (synchronous) flush. Thus, not having HTablePool seems
> like
> > > it
> > > > > > would
> > > > > > > > > cause us to have many more small writes from the client to
> > the
> > > > > > cluster,
> > > > > > > > and
> > > > > > > > > basically wipe out the advantage of turning off autoflush.
> > > > > > > > >
> > > > > > > > > More concretely :
> > > > > > > > >
> > > > > > > > > // Given these two helpers ...
> > > > > > > > >
> > > > > > > > > private HTableInterface getAutoFlushTable(String tableName)
> > > > throws
> > > > > > > > > IOException {
> > > > > > > > >   // (autoflush is true by default)
> > > > > > > > >   return storedConnection.getTable(tableName,
> > executorService);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > private HTableInterface getBufferedTable(String tableName)
> > > throws
> > > > > > > > > IOException {
> > > > > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > > > > >   table.setAutoFlush(false);
> > > > > > > > >   return table;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > // it's my contention that these two methods would behave
> > > almost
> > > > > > > > > identically,
> > > > > > > > > // except the first will hit a synchronous flush during the
> > put
> > > > > call,
> > > > > > > > > and the second will
> > > > > > > > > // flush during the (hidden) close call on table.
> > > > > > > > >
> > > > > > > > > private void writeAutoFlushed(Put somePut) throws
> > IOException {
> > > > > > > > >   try (HTableInterface table =
> getAutoFlushTable(tableName))
> > {
> > > > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > > > >   }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > private void writeBuffered(Put somePut) throws IOException
> {
> > > > > > > > >   try (HTableInterface table =
> getBufferedTable(tableName)) {
> > > > > > > > >     table.put(somePut);
> > > > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > It seems like the only way to avoid this is to have
> > long-lived
> > > > > HTable
> > > > > > > > > instances, which get reused for multiple writes. However,
> > since
> > > > the
> > > > > > > > actual
> > > > > > > > > writes are driven from highly concurrent code, and since
> > HTable
> > > > is
> > > > > > not
> > > > > > > > > threadsafe, this would involve having a number of HTable
> > > > instances,
> > > > > > > and a
> > > > > > > > > control mechanism for leasing them out to individual
> threads
> > > > > safely.
> > > > > > > > Except
> > > > > > > > > at this point it seems like we will have recreated
> > HTablePool,
> > > > > which
> > > > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > > > >
> > > > > > > > > What am I missing here? Since the
> > HTableInterface.setAutoFlush
> > > > > method
> > > > > > > > still
> > > > > > > > > exists, it must be anticipated that users will still want
> to
> > > > buffer
> > > > > > > > writes.
> > > > > > > > > What’s the recommended way to actually buffer a meaningful
> > > number
> > > > > of
> > > > > > > > > writes, from a multithreaded context, that doesn’t just
> > amount
> > > to
> > > > > > > > creating
> > > > > > > > > a table pool?
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > > Aaron
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > > > 
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > >    - Andy
> > > > > > >
> > > > > > > Problems worthy of attack prove their worth by hitting back. -
> > Piet
> > > > > Hein
> > > > > > > (via Tom White)
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein
> > > > > (via Tom White)
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Andrew Purtell <ap...@apache.org>.

I think it would be critical if we're contemplating something that requires
a breaking API change? Do we have that here? I'm not sure.

On Fri, Dec 19, 2014 at 12:02 PM, Solomon Duskis <sd...@gmail.com> wrote:
>
> Is this critical to sort out before 1.0, or is fixing this a post-1.0
> enhancement?
>
> -Solomon
>
> On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > I don't like the dropped writes either. Just pointing out what we have
> now.
> > There is a gap no doubt.
> >
> > On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <nd...@apache.org>
> > wrote:
> > >
> > > Thanks for the reminder about the Multiplexer, Andrew. It sort-of
> solves
> > > this problem, but think it's semantics of dropping writes are not
> > desirable
> > > in the general case. Further, my understanding was that the new
> > connection
> > > implementation is designed to handle this kind of use-case (hence
> cc'ing
> > > Lars).
> > >
> > > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > > >
> > > > Aaron: Please post a copy of that feedback on the JIRA, pretty sure
> we
> > > will
> > > > be having an improvement discussion there.
> > > >
> > > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <
> abeppu@siftscience.com>
> > > > wrote:
> > > > >
> > > > > Nick : Thanks, I've created an issue [1].
> > > > >
> > > > > Pradeep : Yes, I have considered using that. However for the
> moment,
> > > > we've
> > > > > set it out of scope, since our migration from 0.94 -> 0.98 is
> > already a
> > > > bit
> > > > > complicated, and we hoped to separate isolate these changes by not
> > > moving
> > > > > to the async client until after the current migration is complete.
> > > > >
> > > > > Andrew : HTableMultiplexer does seem like it would solve our
> buffered
> > > > write
> > > > > problem, albeit in an awkward way -- thanks! It kind of seems like
> > > HTable
> > > > > should then (if autoFlush == false) send writes to the multiplexer,
> > > > rather
> > > > > than setting it in its own, short-lived writeBuffer. If nothing
> else,
> > > > it's
> > > > > still super confusing that HTableInterface exposes setAutoFlush()
> and
> > > > > setWriteBufferSize(), given that the writeBuffer won't meaningfully
> > > > buffer
> > > > > anything if all tables are short-lived.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > > >
> > > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> > apurtell@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > I believe HTableMultiplexer[1] is meant to stand in for
> HTablePool
> > > for
> > > > > > buffered writing. FWIW, I've not used it.
> > > > > >
> > > > > > 1:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > > >
> > > > > >
> > > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <
> ndimiduk@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > Hi Aaron,
> > > > > > >
> > > > > > > Your analysis is spot on and I do not believe this is by
> design.
> > I
> > > > see
> > > > > > the
> > > > > > > write buffer is owned by the table, while I would have expected
> > > there
> > > > > to
> > > > > > be
> > > > > > > a buffer per table all managed by the connection. I suggest you
> > > > raise a
> > > > > > > blocker ticket vs the 1.0.0 release that's just around the
> corner
> > > to
> > > > > give
> > > > > > > this the attention it needs. Let me know if you're not into
> > JIRA, I
> > > > can
> > > > > > > raise one on your behalf.
> > > > > > >
> > > > > > > cc Lars, Enis.
> > > > > > >
> > > > > > > Nice work Aaron.
> > > > > > > -n
> > > > > > >
> > > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > > abeppu@siftscience.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > TLDR; in the absence of HTablePool, if HTable instances are
> > > > > > short-lived,
> > > > > > > > how should clients use buffered writes?
> > > > > > > >
> > > > > > > > I’m working on migrating a codebase from using 0.94.6
> (CDH4.4)
> > to
> > > > > > 0.98.6
> > > > > > > > (CDH5.2). One issue I’m confused by is how to effectively use
> > > > > buffered
> > > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > > >
> > > > > > > > In our 0.94 code, a pathway could get a table from the pool,
> > > > > configure
> > > > > > it
> > > > > > > > with table.setAutoFlush(false); and write Puts to it. Those
> > > writes
> > > > > > would
> > > > > > > > then go to the table instance’s writeBuffer, and those writes
> > > would
> > > > > > only
> > > > > > > be
> > > > > > > > flushed when the buffer was full, or when we were ready to
> > close
> > > > out
> > > > > > the
> > > > > > > > pool. We were intentionally choosing to have fewer, larger
> > writes
> > > > > from
> > > > > > > the
> > > > > > > > client to the cluster, and we knew we were giving up a degree
> > of
> > > > > safety
> > > > > > > in
> > > > > > > > exchange (i.e. if the client dies after it’s accepted a write
> > but
> > > > > > before
> > > > > > > > the flush for that write occurs, the data is lost). This
> seems
> > to
> > > > be
> > > > > a
> > > > > > > > generally considered a reasonable choice (cf the HBase Book
> [2]
> > > SS
> > > > > > > 14.8.4)
> > > > > > > >
> > > > > > > > However in the 0.98 world, without HTablePool, the endorsed
> > > pattern
> > > > > [3]
> > > > > > > > seems to be to create a new HTable via table =
> > > > > > > > stashedHConnection.getTable(tableName, myExecutorService).
> > > However,
> > > > > > even
> > > > > > > if
> > > > > > > > we do table.setAutoFlush(false), because that table instance
> is
> > > > > > > > short-lived, its buffer never gets full. We’ll create a table
> > > > > instance,
> > > > > > > > write a put to it, try to close the table, and the close call
> > > will
> > > > > > > trigger
> > > > > > > > a (synchronous) flush. Thus, not having HTablePool seems like
> > it
> > > > > would
> > > > > > > > cause us to have many more small writes from the client to
> the
> > > > > cluster,
> > > > > > > and
> > > > > > > > basically wipe out the advantage of turning off autoflush.
> > > > > > > >
> > > > > > > > More concretely :
> > > > > > > >
> > > > > > > > // Given these two helpers ...
> > > > > > > >
> > > > > > > > private HTableInterface getAutoFlushTable(String tableName)
> > > throws
> > > > > > > > IOException {
> > > > > > > >   // (autoflush is true by default)
> > > > > > > >   return storedConnection.getTable(tableName,
> executorService);
> > > > > > > > }
> > > > > > > >
> > > > > > > > private HTableInterface getBufferedTable(String tableName)
> > throws
> > > > > > > > IOException {
> > > > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > > > >   table.setAutoFlush(false);
> > > > > > > >   return table;
> > > > > > > > }
> > > > > > > >
> > > > > > > > // it's my contention that these two methods would behave
> > almost
> > > > > > > > identically,
> > > > > > > > // except the first will hit a synchronous flush during the
> put
> > > > call,
> > > > > > > > and the second will
> > > > > > > > // flush during the (hidden) close call on table.
> > > > > > > >
> > > > > > > > private void writeAutoFlushed(Put somePut) throws
> IOException {
> > > > > > > >   try (HTableInterface table = getAutoFlushTable(tableName))
> {
> > > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > > >   }
> > > > > > > > }
> > > > > > > >
> > > > > > > > private void writeBuffered(Put somePut) throws IOException {
> > > > > > > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > > > > > > >     table.put(somePut);
> > > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > > }
> > > > > > > >
> > > > > > > > It seems like the only way to avoid this is to have
> long-lived
> > > > HTable
> > > > > > > > instances, which get reused for multiple writes. However,
> since
> > > the
> > > > > > > actual
> > > > > > > > writes are driven from highly concurrent code, and since
> HTable
> > > is
> > > > > not
> > > > > > > > threadsafe, this would involve having a number of HTable
> > > instances,
> > > > > > and a
> > > > > > > > control mechanism for leasing them out to individual threads
> > > > safely.
> > > > > > > Except
> > > > > > > > at this point it seems like we will have recreated
> HTablePool,
> > > > which
> > > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > > >
> > > > > > > > What am I missing here? Since the
> HTableInterface.setAutoFlush
> > > > method
> > > > > > > still
> > > > > > > > exists, it must be anticipated that users will still want to
> > > buffer
> > > > > > > writes.
> > > > > > > > What’s the recommended way to actually buffer a meaningful
> > number
> > > > of
> > > > > > > > writes, from a multithreaded context, that doesn’t just
> amount
> > to
> > > > > > > creating
> > > > > > > > a table pool?
> > > > > > > >
> > > > > > > > Thanks in advance,
> > > > > > > > Aaron
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > > [3]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > > 
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > >    - Andy
> > > > > >
> > > > > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > > > Hein
> > > > > > (via Tom White)
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Solomon Duskis <sd...@gmail.com>.

Is this critical to sort out before 1.0, or is fixing this a post-1.0
enhancement?

-Solomon

On Fri, Dec 19, 2014 at 2:19 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> I don't like the dropped writes either. Just pointing out what we have now.
> There is a gap no doubt.
>
> On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <nd...@apache.org>
> wrote:
> >
> > Thanks for the reminder about the Multiplexer, Andrew. It sort-of solves
> > this problem, but think it's semantics of dropping writes are not
> desirable
> > in the general case. Further, my understanding was that the new
> connection
> > implementation is designed to handle this kind of use-case (hence cc'ing
> > Lars).
> >
> > On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <ap...@apache.org>
> > wrote:
> > >
> > > Aaron: Please post a copy of that feedback on the JIRA, pretty sure we
> > will
> > > be having an improvement discussion there.
> > >
> > > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <ab...@siftscience.com>
> > > wrote:
> > > >
> > > > Nick : Thanks, I've created an issue [1].
> > > >
> > > > Pradeep : Yes, I have considered using that. However for the moment,
> > > we've
> > > > set it out of scope, since our migration from 0.94 -> 0.98 is
> already a
> > > bit
> > > > complicated, and we hoped to separate isolate these changes by not
> > moving
> > > > to the async client until after the current migration is complete.
> > > >
> > > > Andrew : HTableMultiplexer does seem like it would solve our buffered
> > > write
> > > > problem, albeit in an awkward way -- thanks! It kind of seems like
> > HTable
> > > > should then (if autoFlush == false) send writes to the multiplexer,
> > > rather
> > > > than setting it in its own, short-lived writeBuffer. If nothing else,
> > > it's
> > > > still super confusing that HTableInterface exposes setAutoFlush() and
> > > > setWriteBufferSize(), given that the writeBuffer won't meaningfully
> > > buffer
> > > > anything if all tables are short-lived.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > > >
> > > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <
> apurtell@apache.org>
> > > > wrote:
> > > > >
> > > > > I believe HTableMultiplexer[1] is meant to stand in for HTablePool
> > for
> > > > > buffered writing. FWIW, I've not used it.
> > > > >
> > > > > 1:
> > > > >
> > > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > > >
> > > > >
> > > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <ndimiduk@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > Hi Aaron,
> > > > > >
> > > > > > Your analysis is spot on and I do not believe this is by design.
> I
> > > see
> > > > > the
> > > > > > write buffer is owned by the table, while I would have expected
> > there
> > > > to
> > > > > be
> > > > > > a buffer per table all managed by the connection. I suggest you
> > > raise a
> > > > > > blocker ticket vs the 1.0.0 release that's just around the corner
> > to
> > > > give
> > > > > > this the attention it needs. Let me know if you're not into
> JIRA, I
> > > can
> > > > > > raise one on your behalf.
> > > > > >
> > > > > > cc Lars, Enis.
> > > > > >
> > > > > > Nice work Aaron.
> > > > > > -n
> > > > > >
> > > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> > abeppu@siftscience.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > TLDR; in the absence of HTablePool, if HTable instances are
> > > > > short-lived,
> > > > > > > how should clients use buffered writes?
> > > > > > >
> > > > > > > I’m working on migrating a codebase from using 0.94.6 (CDH4.4)
> to
> > > > > 0.98.6
> > > > > > > (CDH5.2). One issue I’m confused by is how to effectively use
> > > > buffered
> > > > > > > writes now that HTablePool has been deprecated[1].
> > > > > > >
> > > > > > > In our 0.94 code, a pathway could get a table from the pool,
> > > > configure
> > > > > it
> > > > > > > with table.setAutoFlush(false); and write Puts to it. Those
> > writes
> > > > > would
> > > > > > > then go to the table instance’s writeBuffer, and those writes
> > would
> > > > > only
> > > > > > be
> > > > > > > flushed when the buffer was full, or when we were ready to
> close
> > > out
> > > > > the
> > > > > > > pool. We were intentionally choosing to have fewer, larger
> writes
> > > > from
> > > > > > the
> > > > > > > client to the cluster, and we knew we were giving up a degree
> of
> > > > safety
> > > > > > in
> > > > > > > exchange (i.e. if the client dies after it’s accepted a write
> but
> > > > > before
> > > > > > > the flush for that write occurs, the data is lost). This seems
> to
> > > be
> > > > a
> > > > > > > generally considered a reasonable choice (cf the HBase Book [2]
> > SS
> > > > > > 14.8.4)
> > > > > > >
> > > > > > > However in the 0.98 world, without HTablePool, the endorsed
> > pattern
> > > > [3]
> > > > > > > seems to be to create a new HTable via table =
> > > > > > > stashedHConnection.getTable(tableName, myExecutorService).
> > However,
> > > > > even
> > > > > > if
> > > > > > > we do table.setAutoFlush(false), because that table instance is
> > > > > > > short-lived, its buffer never gets full. We’ll create a table
> > > > instance,
> > > > > > > write a put to it, try to close the table, and the close call
> > will
> > > > > > trigger
> > > > > > > a (synchronous) flush. Thus, not having HTablePool seems like
> it
> > > > would
> > > > > > > cause us to have many more small writes from the client to the
> > > > cluster,
> > > > > > and
> > > > > > > basically wipe out the advantage of turning off autoflush.
> > > > > > >
> > > > > > > More concretely :
> > > > > > >
> > > > > > > // Given these two helpers ...
> > > > > > >
> > > > > > > private HTableInterface getAutoFlushTable(String tableName)
> > throws
> > > > > > > IOException {
> > > > > > >   // (autoflush is true by default)
> > > > > > >   return storedConnection.getTable(tableName, executorService);
> > > > > > > }
> > > > > > >
> > > > > > > private HTableInterface getBufferedTable(String tableName)
> throws
> > > > > > > IOException {
> > > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > > >   table.setAutoFlush(false);
> > > > > > >   return table;
> > > > > > > }
> > > > > > >
> > > > > > > // it's my contention that these two methods would behave
> almost
> > > > > > > identically,
> > > > > > > // except the first will hit a synchronous flush during the put
> > > call,
> > > > > > > and the second will
> > > > > > > // flush during the (hidden) close call on table.
> > > > > > >
> > > > > > > private void writeAutoFlushed(Put somePut) throws IOException {
> > > > > > >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> > > > > > >     table.put(somePut); // will do synchronous flush
> > > > > > >   }
> > > > > > > }
> > > > > > >
> > > > > > > private void writeBuffered(Put somePut) throws IOException {
> > > > > > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > > > > > >     table.put(somePut);
> > > > > > >   } // auto-close will trigger synchronous flush
> > > > > > > }
> > > > > > >
> > > > > > > It seems like the only way to avoid this is to have long-lived
> > > HTable
> > > > > > > instances, which get reused for multiple writes. However, since
> > the
> > > > > > actual
> > > > > > > writes are driven from highly concurrent code, and since HTable
> > is
> > > > not
> > > > > > > threadsafe, this would involve having a number of HTable
> > instances,
> > > > > and a
> > > > > > > control mechanism for leasing them out to individual threads
> > > safely.
> > > > > > Except
> > > > > > > at this point it seems like we will have recreated HTablePool,
> > > which
> > > > > > > suggests that we’re doing something deeply wrong.
> > > > > > >
> > > > > > > What am I missing here? Since the HTableInterface.setAutoFlush
> > > method
> > > > > > still
> > > > > > > exists, it must be anticipated that users will still want to
> > buffer
> > > > > > writes.
> > > > > > > What’s the recommended way to actually buffer a meaningful
> number
> > > of
> > > > > > > writes, from a multithreaded context, that doesn’t just amount
> to
> > > > > > creating
> > > > > > > a table pool?
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Aaron
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > > [3]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > > 
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein
> > > > > (via Tom White)
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Andrew Purtell <ap...@apache.org>.

I don't like the dropped writes either. Just pointing out what we have now.
There is a gap no doubt.

On Fri, Dec 19, 2014 at 11:16 AM, Nick Dimiduk <nd...@apache.org> wrote:
>
> Thanks for the reminder about the Multiplexer, Andrew. It sort-of solves
> this problem, but think it's semantics of dropping writes are not desirable
> in the general case. Further, my understanding was that the new connection
> implementation is designed to handle this kind of use-case (hence cc'ing
> Lars).
>
> On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > Aaron: Please post a copy of that feedback on the JIRA, pretty sure we
> will
> > be having an improvement discussion there.
> >
> > On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <ab...@siftscience.com>
> > wrote:
> > >
> > > Nick : Thanks, I've created an issue [1].
> > >
> > > Pradeep : Yes, I have considered using that. However for the moment,
> > we've
> > > set it out of scope, since our migration from 0.94 -> 0.98 is already a
> > bit
> > > complicated, and we hoped to separate isolate these changes by not
> moving
> > > to the async client until after the current migration is complete.
> > >
> > > Andrew : HTableMultiplexer does seem like it would solve our buffered
> > write
> > > problem, albeit in an awkward way -- thanks! It kind of seems like
> HTable
> > > should then (if autoFlush == false) send writes to the multiplexer,
> > rather
> > > than setting it in its own, short-lived writeBuffer. If nothing else,
> > it's
> > > still super confusing that HTableInterface exposes setAutoFlush() and
> > > setWriteBufferSize(), given that the writeBuffer won't meaningfully
> > buffer
> > > anything if all tables are short-lived.
> > >
> > > [1] https://issues.apache.org/jira/browse/HBASE-12728
> > >
> > > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > > >
> > > > I believe HTableMultiplexer[1] is meant to stand in for HTablePool
> for
> > > > buffered writing. FWIW, I've not used it.
> > > >
> > > > 1:
> > > >
> > > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > > >
> > > >
> > > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <nd...@apache.org>
> > > wrote:
> > > > >
> > > > > Hi Aaron,
> > > > >
> > > > > Your analysis is spot on and I do not believe this is by design. I
> > see
> > > > the
> > > > > write buffer is owned by the table, while I would have expected
> there
> > > to
> > > > be
> > > > > a buffer per table all managed by the connection. I suggest you
> > raise a
> > > > > blocker ticket vs the 1.0.0 release that's just around the corner
> to
> > > give
> > > > > this the attention it needs. Let me know if you're not into JIRA, I
> > can
> > > > > raise one on your behalf.
> > > > >
> > > > > cc Lars, Enis.
> > > > >
> > > > > Nice work Aaron.
> > > > > -n
> > > > >
> > > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <
> abeppu@siftscience.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > TLDR; in the absence of HTablePool, if HTable instances are
> > > > short-lived,
> > > > > > how should clients use buffered writes?
> > > > > >
> > > > > > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to
> > > > 0.98.6
> > > > > > (CDH5.2). One issue I’m confused by is how to effectively use
> > > buffered
> > > > > > writes now that HTablePool has been deprecated[1].
> > > > > >
> > > > > > In our 0.94 code, a pathway could get a table from the pool,
> > > configure
> > > > it
> > > > > > with table.setAutoFlush(false); and write Puts to it. Those
> writes
> > > > would
> > > > > > then go to the table instance’s writeBuffer, and those writes
> would
> > > > only
> > > > > be
> > > > > > flushed when the buffer was full, or when we were ready to close
> > out
> > > > the
> > > > > > pool. We were intentionally choosing to have fewer, larger writes
> > > from
> > > > > the
> > > > > > client to the cluster, and we knew we were giving up a degree of
> > > safety
> > > > > in
> > > > > > exchange (i.e. if the client dies after it’s accepted a write but
> > > > before
> > > > > > the flush for that write occurs, the data is lost). This seems to
> > be
> > > a
> > > > > > generally considered a reasonable choice (cf the HBase Book [2]
> SS
> > > > > 14.8.4)
> > > > > >
> > > > > > However in the 0.98 world, without HTablePool, the endorsed
> pattern
> > > [3]
> > > > > > seems to be to create a new HTable via table =
> > > > > > stashedHConnection.getTable(tableName, myExecutorService).
> However,
> > > > even
> > > > > if
> > > > > > we do table.setAutoFlush(false), because that table instance is
> > > > > > short-lived, its buffer never gets full. We’ll create a table
> > > instance,
> > > > > > write a put to it, try to close the table, and the close call
> will
> > > > > trigger
> > > > > > a (synchronous) flush. Thus, not having HTablePool seems like it
> > > would
> > > > > > cause us to have many more small writes from the client to the
> > > cluster,
> > > > > and
> > > > > > basically wipe out the advantage of turning off autoflush.
> > > > > >
> > > > > > More concretely :
> > > > > >
> > > > > > // Given these two helpers ...
> > > > > >
> > > > > > private HTableInterface getAutoFlushTable(String tableName)
> throws
> > > > > > IOException {
> > > > > >   // (autoflush is true by default)
> > > > > >   return storedConnection.getTable(tableName, executorService);
> > > > > > }
> > > > > >
> > > > > > private HTableInterface getBufferedTable(String tableName) throws
> > > > > > IOException {
> > > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > > >   table.setAutoFlush(false);
> > > > > >   return table;
> > > > > > }
> > > > > >
> > > > > > // it's my contention that these two methods would behave almost
> > > > > > identically,
> > > > > > // except the first will hit a synchronous flush during the put
> > call,
> > > > > > and the second will
> > > > > > // flush during the (hidden) close call on table.
> > > > > >
> > > > > > private void writeAutoFlushed(Put somePut) throws IOException {
> > > > > >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> > > > > >     table.put(somePut); // will do synchronous flush
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > > private void writeBuffered(Put somePut) throws IOException {
> > > > > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > > > > >     table.put(somePut);
> > > > > >   } // auto-close will trigger synchronous flush
> > > > > > }
> > > > > >
> > > > > > It seems like the only way to avoid this is to have long-lived
> > HTable
> > > > > > instances, which get reused for multiple writes. However, since
> the
> > > > > actual
> > > > > > writes are driven from highly concurrent code, and since HTable
> is
> > > not
> > > > > > threadsafe, this would involve having a number of HTable
> instances,
> > > > and a
> > > > > > control mechanism for leasing them out to individual threads
> > safely.
> > > > > Except
> > > > > > at this point it seems like we will have recreated HTablePool,
> > which
> > > > > > suggests that we’re doing something deeply wrong.
> > > > > >
> > > > > > What am I missing here? Since the HTableInterface.setAutoFlush
> > method
> > > > > still
> > > > > > exists, it must be anticipated that users will still want to
> buffer
> > > > > writes.
> > > > > > What’s the recommended way to actually buffer a meaningful number
> > of
> > > > > > writes, from a multithreaded context, that doesn’t just amount to
> > > > > creating
> > > > > > a table pool?
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Aaron
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > > [3]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > > 
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Nick Dimiduk <nd...@apache.org>.

Thanks for the reminder about the Multiplexer, Andrew. It sort-of solves
this problem, but think it's semantics of dropping writes are not desirable
in the general case. Further, my understanding was that the new connection
implementation is designed to handle this kind of use-case (hence cc'ing
Lars).

On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <ap...@apache.org>
wrote:
>
> Aaron: Please post a copy of that feedback on the JIRA, pretty sure we will
> be having an improvement discussion there.
>
> On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <ab...@siftscience.com>
> wrote:
> >
> > Nick : Thanks, I've created an issue [1].
> >
> > Pradeep : Yes, I have considered using that. However for the moment,
> we've
> > set it out of scope, since our migration from 0.94 -> 0.98 is already a
> bit
> > complicated, and we hoped to separate isolate these changes by not moving
> > to the async client until after the current migration is complete.
> >
> > Andrew : HTableMultiplexer does seem like it would solve our buffered
> write
> > problem, albeit in an awkward way -- thanks! It kind of seems like HTable
> > should then (if autoFlush == false) send writes to the multiplexer,
> rather
> > than setting it in its own, short-lived writeBuffer. If nothing else,
> it's
> > still super confusing that HTableInterface exposes setAutoFlush() and
> > setWriteBufferSize(), given that the writeBuffer won't meaningfully
> buffer
> > anything if all tables are short-lived.
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-12728
> >
> > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <ap...@apache.org>
> > wrote:
> > >
> > > I believe HTableMultiplexer[1] is meant to stand in for HTablePool for
> > > buffered writing. FWIW, I've not used it.
> > >
> > > 1:
> > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > >
> > >
> > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <nd...@apache.org>
> > wrote:
> > > >
> > > > Hi Aaron,
> > > >
> > > > Your analysis is spot on and I do not believe this is by design. I
> see
> > > the
> > > > write buffer is owned by the table, while I would have expected there
> > to
> > > be
> > > > a buffer per table all managed by the connection. I suggest you
> raise a
> > > > blocker ticket vs the 1.0.0 release that's just around the corner to
> > give
> > > > this the attention it needs. Let me know if you're not into JIRA, I
> can
> > > > raise one on your behalf.
> > > >
> > > > cc Lars, Enis.
> > > >
> > > > Nice work Aaron.
> > > > -n
> > > >
> > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <abeppu@siftscience.com
> >
> > > > wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > TLDR; in the absence of HTablePool, if HTable instances are
> > > short-lived,
> > > > > how should clients use buffered writes?
> > > > >
> > > > > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to
> > > 0.98.6
> > > > > (CDH5.2). One issue I’m confused by is how to effectively use
> > buffered
> > > > > writes now that HTablePool has been deprecated[1].
> > > > >
> > > > > In our 0.94 code, a pathway could get a table from the pool,
> > configure
> > > it
> > > > > with table.setAutoFlush(false); and write Puts to it. Those writes
> > > would
> > > > > then go to the table instance’s writeBuffer, and those writes would
> > > only
> > > > be
> > > > > flushed when the buffer was full, or when we were ready to close
> out
> > > the
> > > > > pool. We were intentionally choosing to have fewer, larger writes
> > from
> > > > the
> > > > > client to the cluster, and we knew we were giving up a degree of
> > safety
> > > > in
> > > > > exchange (i.e. if the client dies after it’s accepted a write but
> > > before
> > > > > the flush for that write occurs, the data is lost). This seems to
> be
> > a
> > > > > generally considered a reasonable choice (cf the HBase Book [2] SS
> > > > 14.8.4)
> > > > >
> > > > > However in the 0.98 world, without HTablePool, the endorsed pattern
> > [3]
> > > > > seems to be to create a new HTable via table =
> > > > > stashedHConnection.getTable(tableName, myExecutorService). However,
> > > even
> > > > if
> > > > > we do table.setAutoFlush(false), because that table instance is
> > > > > short-lived, its buffer never gets full. We’ll create a table
> > instance,
> > > > > write a put to it, try to close the table, and the close call will
> > > > trigger
> > > > > a (synchronous) flush. Thus, not having HTablePool seems like it
> > would
> > > > > cause us to have many more small writes from the client to the
> > cluster,
> > > > and
> > > > > basically wipe out the advantage of turning off autoflush.
> > > > >
> > > > > More concretely :
> > > > >
> > > > > // Given these two helpers ...
> > > > >
> > > > > private HTableInterface getAutoFlushTable(String tableName) throws
> > > > > IOException {
> > > > >   // (autoflush is true by default)
> > > > >   return storedConnection.getTable(tableName, executorService);
> > > > > }
> > > > >
> > > > > private HTableInterface getBufferedTable(String tableName) throws
> > > > > IOException {
> > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > >   table.setAutoFlush(false);
> > > > >   return table;
> > > > > }
> > > > >
> > > > > // it's my contention that these two methods would behave almost
> > > > > identically,
> > > > > // except the first will hit a synchronous flush during the put
> call,
> > > > > and the second will
> > > > > // flush during the (hidden) close call on table.
> > > > >
> > > > > private void writeAutoFlushed(Put somePut) throws IOException {
> > > > >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> > > > >     table.put(somePut); // will do synchronous flush
> > > > >   }
> > > > > }
> > > > >
> > > > > private void writeBuffered(Put somePut) throws IOException {
> > > > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > > > >     table.put(somePut);
> > > > >   } // auto-close will trigger synchronous flush
> > > > > }
> > > > >
> > > > > It seems like the only way to avoid this is to have long-lived
> HTable
> > > > > instances, which get reused for multiple writes. However, since the
> > > > actual
> > > > > writes are driven from highly concurrent code, and since HTable is
> > not
> > > > > threadsafe, this would involve having a number of HTable instances,
> > > and a
> > > > > control mechanism for leasing them out to individual threads
> safely.
> > > > Except
> > > > > at this point it seems like we will have recreated HTablePool,
> which
> > > > > suggests that we’re doing something deeply wrong.
> > > > >
> > > > > What am I missing here? Since the HTableInterface.setAutoFlush
> method
> > > > still
> > > > > exists, it must be anticipated that users will still want to buffer
> > > > writes.
> > > > > What’s the recommended way to actually buffer a meaningful number
> of
> > > > > writes, from a multithreaded context, that doesn’t just amount to
> > > > creating
> > > > > a table pool?
> > > > >
> > > > > Thanks in advance,
> > > > > Aaron
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > 
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Andrew Purtell <ap...@apache.org>.

Aaron: Please post a copy of that feedback on the JIRA, pretty sure we will
be having an improvement discussion there.

On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <ab...@siftscience.com>
wrote:
>
> Nick : Thanks, I've created an issue [1].
>
> Pradeep : Yes, I have considered using that. However for the moment, we've
> set it out of scope, since our migration from 0.94 -> 0.98 is already a bit
> complicated, and we hoped to separate isolate these changes by not moving
> to the async client until after the current migration is complete.
>
> Andrew : HTableMultiplexer does seem like it would solve our buffered write
> problem, albeit in an awkward way -- thanks! It kind of seems like HTable
> should then (if autoFlush == false) send writes to the multiplexer, rather
> than setting it in its own, short-lived writeBuffer. If nothing else, it's
> still super confusing that HTableInterface exposes setAutoFlush() and
> setWriteBufferSize(), given that the writeBuffer won't meaningfully buffer
> anything if all tables are short-lived.
>
> [1] https://issues.apache.org/jira/browse/HBASE-12728
>
> On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > I believe HTableMultiplexer[1] is meant to stand in for HTablePool for
> > buffered writing. FWIW, I've not used it.
> >
> > 1:
> >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> >
> >
> > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <nd...@apache.org>
> wrote:
> > >
> > > Hi Aaron,
> > >
> > > Your analysis is spot on and I do not believe this is by design. I see
> > the
> > > write buffer is owned by the table, while I would have expected there
> to
> > be
> > > a buffer per table all managed by the connection. I suggest you raise a
> > > blocker ticket vs the 1.0.0 release that's just around the corner to
> give
> > > this the attention it needs. Let me know if you're not into JIRA, I can
> > > raise one on your behalf.
> > >
> > > cc Lars, Enis.
> > >
> > > Nice work Aaron.
> > > -n
> > >
> > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <ab...@siftscience.com>
> > > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > TLDR; in the absence of HTablePool, if HTable instances are
> > short-lived,
> > > > how should clients use buffered writes?
> > > >
> > > > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to
> > 0.98.6
> > > > (CDH5.2). One issue I’m confused by is how to effectively use
> buffered
> > > > writes now that HTablePool has been deprecated[1].
> > > >
> > > > In our 0.94 code, a pathway could get a table from the pool,
> configure
> > it
> > > > with table.setAutoFlush(false); and write Puts to it. Those writes
> > would
> > > > then go to the table instance’s writeBuffer, and those writes would
> > only
> > > be
> > > > flushed when the buffer was full, or when we were ready to close out
> > the
> > > > pool. We were intentionally choosing to have fewer, larger writes
> from
> > > the
> > > > client to the cluster, and we knew we were giving up a degree of
> safety
> > > in
> > > > exchange (i.e. if the client dies after it’s accepted a write but
> > before
> > > > the flush for that write occurs, the data is lost). This seems to be
> a
> > > > generally considered a reasonable choice (cf the HBase Book [2] SS
> > > 14.8.4)
> > > >
> > > > However in the 0.98 world, without HTablePool, the endorsed pattern
> [3]
> > > > seems to be to create a new HTable via table =
> > > > stashedHConnection.getTable(tableName, myExecutorService). However,
> > even
> > > if
> > > > we do table.setAutoFlush(false), because that table instance is
> > > > short-lived, its buffer never gets full. We’ll create a table
> instance,
> > > > write a put to it, try to close the table, and the close call will
> > > trigger
> > > > a (synchronous) flush. Thus, not having HTablePool seems like it
> would
> > > > cause us to have many more small writes from the client to the
> cluster,
> > > and
> > > > basically wipe out the advantage of turning off autoflush.
> > > >
> > > > More concretely :
> > > >
> > > > // Given these two helpers ...
> > > >
> > > > private HTableInterface getAutoFlushTable(String tableName) throws
> > > > IOException {
> > > >   // (autoflush is true by default)
> > > >   return storedConnection.getTable(tableName, executorService);
> > > > }
> > > >
> > > > private HTableInterface getBufferedTable(String tableName) throws
> > > > IOException {
> > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > >   table.setAutoFlush(false);
> > > >   return table;
> > > > }
> > > >
> > > > // it's my contention that these two methods would behave almost
> > > > identically,
> > > > // except the first will hit a synchronous flush during the put call,
> > > > and the second will
> > > > // flush during the (hidden) close call on table.
> > > >
> > > > private void writeAutoFlushed(Put somePut) throws IOException {
> > > >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> > > >     table.put(somePut); // will do synchronous flush
> > > >   }
> > > > }
> > > >
> > > > private void writeBuffered(Put somePut) throws IOException {
> > > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > > >     table.put(somePut);
> > > >   } // auto-close will trigger synchronous flush
> > > > }
> > > >
> > > > It seems like the only way to avoid this is to have long-lived HTable
> > > > instances, which get reused for multiple writes. However, since the
> > > actual
> > > > writes are driven from highly concurrent code, and since HTable is
> not
> > > > threadsafe, this would involve having a number of HTable instances,
> > and a
> > > > control mechanism for leasing them out to individual threads safely.
> > > Except
> > > > at this point it seems like we will have recreated HTablePool, which
> > > > suggests that we’re doing something deeply wrong.
> > > >
> > > > What am I missing here? Since the HTableInterface.setAutoFlush method
> > > still
> > > > exists, it must be anticipated that users will still want to buffer
> > > writes.
> > > > What’s the recommended way to actually buffer a meaningful number of
> > > > writes, from a multithreaded context, that doesn’t just amount to
> > > creating
> > > > a table pool?
> > > >
> > > > Thanks in advance,
> > > > Aaron
> > > >
> > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > [3]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > 
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Aaron Beppu <ab...@siftscience.com>.

Nick : Thanks, I've created an issue [1].

Pradeep : Yes, I have considered using that. However for the moment, we've
set it out of scope, since our migration from 0.94 -> 0.98 is already a bit
complicated, and we hoped to separate isolate these changes by not moving
to the async client until after the current migration is complete.

Andrew : HTableMultiplexer does seem like it would solve our buffered write
problem, albeit in an awkward way -- thanks! It kind of seems like HTable
should then (if autoFlush == false) send writes to the multiplexer, rather
than setting it in its own, short-lived writeBuffer. If nothing else, it's
still super confusing that HTableInterface exposes setAutoFlush() and
setWriteBufferSize(), given that the writeBuffer won't meaningfully buffer
anything if all tables are short-lived.

[1] https://issues.apache.org/jira/browse/HBASE-12728

On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <ap...@apache.org>
wrote:
>
> I believe HTableMultiplexer[1] is meant to stand in for HTablePool for
> buffered writing. FWIW, I've not used it.
>
> 1:
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
>
>
> On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <nd...@apache.org> wrote:
> >
> > Hi Aaron,
> >
> > Your analysis is spot on and I do not believe this is by design. I see
> the
> > write buffer is owned by the table, while I would have expected there to
> be
> > a buffer per table all managed by the connection. I suggest you raise a
> > blocker ticket vs the 1.0.0 release that's just around the corner to give
> > this the attention it needs. Let me know if you're not into JIRA, I can
> > raise one on your behalf.
> >
> > cc Lars, Enis.
> >
> > Nice work Aaron.
> > -n
> >
> > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <ab...@siftscience.com>
> > wrote:
> > >
> > > Hi All,
> > >
> > > TLDR; in the absence of HTablePool, if HTable instances are
> short-lived,
> > > how should clients use buffered writes?
> > >
> > > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to
> 0.98.6
> > > (CDH5.2). One issue I’m confused by is how to effectively use buffered
> > > writes now that HTablePool has been deprecated[1].
> > >
> > > In our 0.94 code, a pathway could get a table from the pool, configure
> it
> > > with table.setAutoFlush(false); and write Puts to it. Those writes
> would
> > > then go to the table instance’s writeBuffer, and those writes would
> only
> > be
> > > flushed when the buffer was full, or when we were ready to close out
> the
> > > pool. We were intentionally choosing to have fewer, larger writes from
> > the
> > > client to the cluster, and we knew we were giving up a degree of safety
> > in
> > > exchange (i.e. if the client dies after it’s accepted a write but
> before
> > > the flush for that write occurs, the data is lost). This seems to be a
> > > generally considered a reasonable choice (cf the HBase Book [2] SS
> > 14.8.4)
> > >
> > > However in the 0.98 world, without HTablePool, the endorsed pattern [3]
> > > seems to be to create a new HTable via table =
> > > stashedHConnection.getTable(tableName, myExecutorService). However,
> even
> > if
> > > we do table.setAutoFlush(false), because that table instance is
> > > short-lived, its buffer never gets full. We’ll create a table instance,
> > > write a put to it, try to close the table, and the close call will
> > trigger
> > > a (synchronous) flush. Thus, not having HTablePool seems like it would
> > > cause us to have many more small writes from the client to the cluster,
> > and
> > > basically wipe out the advantage of turning off autoflush.
> > >
> > > More concretely :
> > >
> > > // Given these two helpers ...
> > >
> > > private HTableInterface getAutoFlushTable(String tableName) throws
> > > IOException {
> > >   // (autoflush is true by default)
> > >   return storedConnection.getTable(tableName, executorService);
> > > }
> > >
> > > private HTableInterface getBufferedTable(String tableName) throws
> > > IOException {
> > >   HTableInterface table = getAutoFlushTable(tableName);
> > >   table.setAutoFlush(false);
> > >   return table;
> > > }
> > >
> > > // it's my contention that these two methods would behave almost
> > > identically,
> > > // except the first will hit a synchronous flush during the put call,
> > > and the second will
> > > // flush during the (hidden) close call on table.
> > >
> > > private void writeAutoFlushed(Put somePut) throws IOException {
> > >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> > >     table.put(somePut); // will do synchronous flush
> > >   }
> > > }
> > >
> > > private void writeBuffered(Put somePut) throws IOException {
> > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > >     table.put(somePut);
> > >   } // auto-close will trigger synchronous flush
> > > }
> > >
> > > It seems like the only way to avoid this is to have long-lived HTable
> > > instances, which get reused for multiple writes. However, since the
> > actual
> > > writes are driven from highly concurrent code, and since HTable is not
> > > threadsafe, this would involve having a number of HTable instances,
> and a
> > > control mechanism for leasing them out to individual threads safely.
> > Except
> > > at this point it seems like we will have recreated HTablePool, which
> > > suggests that we’re doing something deeply wrong.
> > >
> > > What am I missing here? Since the HTableInterface.setAutoFlush method
> > still
> > > exists, it must be anticipated that users will still want to buffer
> > writes.
> > > What’s the recommended way to actually buffer a meaningful number of
> > > writes, from a multithreaded context, that doesn’t just amount to
> > creating
> > > a table pool?
> > >
> > > Thanks in advance,
> > > Aaron
> > >
> > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > [2] http://hbase.apache.org/book/perf.writing.html
> > > [3]
> > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > 
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Andrew Purtell <ap...@apache.org>.

I believe HTableMultiplexer[1] is meant to stand in for HTablePool for
buffered writing. FWIW, I've not used it.

1:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html


On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <nd...@apache.org> wrote:
>
> Hi Aaron,
>
> Your analysis is spot on and I do not believe this is by design. I see the
> write buffer is owned by the table, while I would have expected there to be
> a buffer per table all managed by the connection. I suggest you raise a
> blocker ticket vs the 1.0.0 release that's just around the corner to give
> this the attention it needs. Let me know if you're not into JIRA, I can
> raise one on your behalf.
>
> cc Lars, Enis.
>
> Nice work Aaron.
> -n
>
> On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <ab...@siftscience.com>
> wrote:
> >
> > Hi All,
> >
> > TLDR; in the absence of HTablePool, if HTable instances are short-lived,
> > how should clients use buffered writes?
> >
> > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
> > (CDH5.2). One issue I’m confused by is how to effectively use buffered
> > writes now that HTablePool has been deprecated[1].
> >
> > In our 0.94 code, a pathway could get a table from the pool, configure it
> > with table.setAutoFlush(false); and write Puts to it. Those writes would
> > then go to the table instance’s writeBuffer, and those writes would only
> be
> > flushed when the buffer was full, or when we were ready to close out the
> > pool. We were intentionally choosing to have fewer, larger writes from
> the
> > client to the cluster, and we knew we were giving up a degree of safety
> in
> > exchange (i.e. if the client dies after it’s accepted a write but before
> > the flush for that write occurs, the data is lost). This seems to be a
> > generally considered a reasonable choice (cf the HBase Book [2] SS
> 14.8.4)
> >
> > However in the 0.98 world, without HTablePool, the endorsed pattern [3]
> > seems to be to create a new HTable via table =
> > stashedHConnection.getTable(tableName, myExecutorService). However, even
> if
> > we do table.setAutoFlush(false), because that table instance is
> > short-lived, its buffer never gets full. We’ll create a table instance,
> > write a put to it, try to close the table, and the close call will
> trigger
> > a (synchronous) flush. Thus, not having HTablePool seems like it would
> > cause us to have many more small writes from the client to the cluster,
> and
> > basically wipe out the advantage of turning off autoflush.
> >
> > More concretely :
> >
> > // Given these two helpers ...
> >
> > private HTableInterface getAutoFlushTable(String tableName) throws
> > IOException {
> >   // (autoflush is true by default)
> >   return storedConnection.getTable(tableName, executorService);
> > }
> >
> > private HTableInterface getBufferedTable(String tableName) throws
> > IOException {
> >   HTableInterface table = getAutoFlushTable(tableName);
> >   table.setAutoFlush(false);
> >   return table;
> > }
> >
> > // it's my contention that these two methods would behave almost
> > identically,
> > // except the first will hit a synchronous flush during the put call,
> > and the second will
> > // flush during the (hidden) close call on table.
> >
> > private void writeAutoFlushed(Put somePut) throws IOException {
> >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> >     table.put(somePut); // will do synchronous flush
> >   }
> > }
> >
> > private void writeBuffered(Put somePut) throws IOException {
> >   try (HTableInterface table = getBufferedTable(tableName)) {
> >     table.put(somePut);
> >   } // auto-close will trigger synchronous flush
> > }
> >
> > It seems like the only way to avoid this is to have long-lived HTable
> > instances, which get reused for multiple writes. However, since the
> actual
> > writes are driven from highly concurrent code, and since HTable is not
> > threadsafe, this would involve having a number of HTable instances, and a
> > control mechanism for leasing them out to individual threads safely.
> Except
> > at this point it seems like we will have recreated HTablePool, which
> > suggests that we’re doing something deeply wrong.
> >
> > What am I missing here? Since the HTableInterface.setAutoFlush method
> still
> > exists, it must be anticipated that users will still want to buffer
> writes.
> > What’s the recommended way to actually buffer a meaningful number of
> > writes, from a multithreaded context, that doesn’t just amount to
> creating
> > a table pool?
> >
> > Thanks in advance,
> > Aaron
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > [2] http://hbase.apache.org/book/perf.writing.html
> > [3]
> >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > 
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Pradeep Gollakota <pr...@gmail.com>.

Hi Aaron,

Just out of curiosity, have you considered using asynchbase?
https://github.com/OpenTSDB/asynchbase


On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <nd...@apache.org> wrote:

> Hi Aaron,
>
> Your analysis is spot on and I do not believe this is by design. I see the
> write buffer is owned by the table, while I would have expected there to be
> a buffer per table all managed by the connection. I suggest you raise a
> blocker ticket vs the 1.0.0 release that's just around the corner to give
> this the attention it needs. Let me know if you're not into JIRA, I can
> raise one on your behalf.
>
> cc Lars, Enis.
>
> Nice work Aaron.
> -n
>
> On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <ab...@siftscience.com>
> wrote:
> >
> > Hi All,
> >
> > TLDR; in the absence of HTablePool, if HTable instances are short-lived,
> > how should clients use buffered writes?
> >
> > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
> > (CDH5.2). One issue I’m confused by is how to effectively use buffered
> > writes now that HTablePool has been deprecated[1].
> >
> > In our 0.94 code, a pathway could get a table from the pool, configure it
> > with table.setAutoFlush(false); and write Puts to it. Those writes would
> > then go to the table instance’s writeBuffer, and those writes would only
> be
> > flushed when the buffer was full, or when we were ready to close out the
> > pool. We were intentionally choosing to have fewer, larger writes from
> the
> > client to the cluster, and we knew we were giving up a degree of safety
> in
> > exchange (i.e. if the client dies after it’s accepted a write but before
> > the flush for that write occurs, the data is lost). This seems to be a
> > generally considered a reasonable choice (cf the HBase Book [2] SS
> 14.8.4)
> >
> > However in the 0.98 world, without HTablePool, the endorsed pattern [3]
> > seems to be to create a new HTable via table =
> > stashedHConnection.getTable(tableName, myExecutorService). However, even
> if
> > we do table.setAutoFlush(false), because that table instance is
> > short-lived, its buffer never gets full. We’ll create a table instance,
> > write a put to it, try to close the table, and the close call will
> trigger
> > a (synchronous) flush. Thus, not having HTablePool seems like it would
> > cause us to have many more small writes from the client to the cluster,
> and
> > basically wipe out the advantage of turning off autoflush.
> >
> > More concretely :
> >
> > // Given these two helpers ...
> >
> > private HTableInterface getAutoFlushTable(String tableName) throws
> > IOException {
> >   // (autoflush is true by default)
> >   return storedConnection.getTable(tableName, executorService);
> > }
> >
> > private HTableInterface getBufferedTable(String tableName) throws
> > IOException {
> >   HTableInterface table = getAutoFlushTable(tableName);
> >   table.setAutoFlush(false);
> >   return table;
> > }
> >
> > // it's my contention that these two methods would behave almost
> > identically,
> > // except the first will hit a synchronous flush during the put call,
> > and the second will
> > // flush during the (hidden) close call on table.
> >
> > private void writeAutoFlushed(Put somePut) throws IOException {
> >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> >     table.put(somePut); // will do synchronous flush
> >   }
> > }
> >
> > private void writeBuffered(Put somePut) throws IOException {
> >   try (HTableInterface table = getBufferedTable(tableName)) {
> >     table.put(somePut);
> >   } // auto-close will trigger synchronous flush
> > }
> >
> > It seems like the only way to avoid this is to have long-lived HTable
> > instances, which get reused for multiple writes. However, since the
> actual
> > writes are driven from highly concurrent code, and since HTable is not
> > threadsafe, this would involve having a number of HTable instances, and a
> > control mechanism for leasing them out to individual threads safely.
> Except
> > at this point it seems like we will have recreated HTablePool, which
> > suggests that we’re doing something deeply wrong.
> >
> > What am I missing here? Since the HTableInterface.setAutoFlush method
> still
> > exists, it must be anticipated that users will still want to buffer
> writes.
> > What’s the recommended way to actually buffer a meaningful number of
> > writes, from a multithreaded context, that doesn’t just amount to
> creating
> > a table pool?
> >
> > Thanks in advance,
> > Aaron
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > [2] http://hbase.apache.org/book/perf.writing.html
> > [3]
> >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > 
> >
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Posted by Nick Dimiduk <nd...@apache.org>.

Hi Aaron,

Your analysis is spot on and I do not believe this is by design. I see the
write buffer is owned by the table, while I would have expected there to be
a buffer per table all managed by the connection. I suggest you raise a
blocker ticket vs the 1.0.0 release that's just around the corner to give
this the attention it needs. Let me know if you're not into JIRA, I can
raise one on your behalf.

cc Lars, Enis.

Nice work Aaron.
-n

On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <ab...@siftscience.com> wrote:
>
> Hi All,
>
> TLDR; in the absence of HTablePool, if HTable instances are short-lived,
> how should clients use buffered writes?
>
> I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
> (CDH5.2). One issue I’m confused by is how to effectively use buffered
> writes now that HTablePool has been deprecated[1].
>
> In our 0.94 code, a pathway could get a table from the pool, configure it
> with table.setAutoFlush(false); and write Puts to it. Those writes would
> then go to the table instance’s writeBuffer, and those writes would only be
> flushed when the buffer was full, or when we were ready to close out the
> pool. We were intentionally choosing to have fewer, larger writes from the
> client to the cluster, and we knew we were giving up a degree of safety in
> exchange (i.e. if the client dies after it’s accepted a write but before
> the flush for that write occurs, the data is lost). This seems to be a
> generally considered a reasonable choice (cf the HBase Book [2] SS 14.8.4)
>
> However in the 0.98 world, without HTablePool, the endorsed pattern [3]
> seems to be to create a new HTable via table =
> stashedHConnection.getTable(tableName, myExecutorService). However, even if
> we do table.setAutoFlush(false), because that table instance is
> short-lived, its buffer never gets full. We’ll create a table instance,
> write a put to it, try to close the table, and the close call will trigger
> a (synchronous) flush. Thus, not having HTablePool seems like it would
> cause us to have many more small writes from the client to the cluster, and
> basically wipe out the advantage of turning off autoflush.
>
> More concretely :
>
> // Given these two helpers ...
>
> private HTableInterface getAutoFlushTable(String tableName) throws
> IOException {
>   // (autoflush is true by default)
>   return storedConnection.getTable(tableName, executorService);
> }
>
> private HTableInterface getBufferedTable(String tableName) throws
> IOException {
>   HTableInterface table = getAutoFlushTable(tableName);
>   table.setAutoFlush(false);
>   return table;
> }
>
> // it's my contention that these two methods would behave almost
> identically,
> // except the first will hit a synchronous flush during the put call,
> and the second will
> // flush during the (hidden) close call on table.
>
> private void writeAutoFlushed(Put somePut) throws IOException {
>   try (HTableInterface table = getAutoFlushTable(tableName)) {
>     table.put(somePut); // will do synchronous flush
>   }
> }
>
> private void writeBuffered(Put somePut) throws IOException {
>   try (HTableInterface table = getBufferedTable(tableName)) {
>     table.put(somePut);
>   } // auto-close will trigger synchronous flush
> }
>
> It seems like the only way to avoid this is to have long-lived HTable
> instances, which get reused for multiple writes. However, since the actual
> writes are driven from highly concurrent code, and since HTable is not
> threadsafe, this would involve having a number of HTable instances, and a
> control mechanism for leasing them out to individual threads safely. Except
> at this point it seems like we will have recreated HTablePool, which
> suggests that we’re doing something deeply wrong.
>
> What am I missing here? Since the HTableInterface.setAutoFlush method still
> exists, it must be anticipated that users will still want to buffer writes.
> What’s the recommended way to actually buffer a meaningful number of
> writes, from a multithreaded context, that doesn’t just amount to creating
> a table pool?
>
> Thanks in advance,
> Aaron
>
> [1] https://issues.apache.org/jira/browse/HBASE-6580
> [2] http://hbase.apache.org/book/perf.writing.html
> [3]
>
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> 
>