You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@s2graph.apache.org by DO YUNG YOON <sh...@gmail.com> on 2016/04/20 03:01:23 UTC

Provide storage which use HBase's native client(not Asynchbase).

Hi All.

Since implementing storage becomes easier(I believe), I think it is good to
have HBaseStroage which use HBase's native client.
The reason I brought up this is following.

1. in some environment, especially specific Hadoop and Spark cluster
distribution,
s2core have guava version conflict which comes from asynchbase.
  - Many cases it is necessary to process stream of edges and write into
HBase directly on streaming processing.
  - Currently, there is no way to specify version to avoid above problem.
With Native HBaseClient, users will be select right version for there
environment.
2. It would be fun to run benchmark on these two client.

any feedback would be appreciated.

Best Regards.
DOYUNG YOON

Re: Provide storage which use HBase's native client(not Asynchbase).

Posted by Jong Wook Kim <jo...@nyu.edu>.
Following a month-old discussion,

I don’t think Asynchbase 1.7.1 supports the operation-level RPC timeout, and the related issue #11 <https://github.com/OpenTSDB/asynchbase/issues/11> is being delayed from v1.5.0 to v1.8.0, so we never know, how it would happen, unless we decide to contribute back to asynchbase.

The custom asynchbase currently has the following three methods:

- org.hbase.async.Scanner.setRpcTimeout()
- org.hbase.async.GetRequest.setMaxResultsPerColumnFamily()
- org.hbase.async.GetRequest.setRowOffsetPerColumnFamily()

which are only in the SteamShon’s patched version- so I guess we still need to keep our patched version separately.

To work on S2GRAPH-74, I have to solve the Guava version conflict issue, and I’ll go ahead and replace the 3 jars in s2core/lib with the shaded jar. 

Jong Wook


> On May 3, 2016, at 11:38 AM, DO YUNG YOON <sh...@gmail.com> wrote:
> 
> Agree with avoid unmanaged jars and publishing to maven cental.
> 
> As far as I remember, we applied custom patch to control rpc time per
> request, but I guess Asynchbase 1.7.1 also support it(not sure though).
> Let me check if we are still rely on custom patch. if we don't need custom
> patch, I think we should go with
> https://github.com/jongwook/asynchbase-shaded.
> 
> Thanks for your works, Jong Wook, I will update this after check.
> 
> 
> On Mon, May 2, 2016 at 1:18 PM Jong Wook Kim <jo...@nyu.edu> wrote:
> 
>> So I went ahead and made an asynchbase package that shades Google Guava,
>> and thought it is a good chance to:
>> 
>> - avoid pulling duplicate Netty from two organizations - io.netty and
>> org.jboss.netty.
>> - remove log4j-over-slf4j from the runtime dependency: this assumes that
>> the application is not using log4j in favor of slf4j, and using something
>> else like logback. Most Apache hadoop/spark environment unfortunately uses
>> log4j, and due to its distributed nature it's not easy to switch. Better
>> not enforce anything on logging implementation as a library and let the
>> application decide.
>> 
>> I first started off making a Gradle project that pulls the official
>> Asynchbase 1.7.1 and relocates packages:
>> https://github.com/jongwook/asynchbase-shaded
>> 
>> But then I realized that s2graph was using a custom version of Asynchbase
>> for RPC timeout, etc., so I ended up making a fork of SteamShon/asynchbase
>> and used maven-shade-plugin to create the shaded jar:
>> https://github.com/jongwook/asynchbase
>> 
>> Running make && make pom.xml && mvn -DskipTests=true install will make a
>> shaded jar and pom in ~/.m2 provided that protoc 2.5.0 is installed, and
>> replacing jars at s2core/lib/ with the shaded jar seems to be working well.
>> 
>> Carrying around unmanaged jars in the repository is not a good idea, so we
>> need to consider publishing this version of asynchbase to maven central or
>> to some apache repo.
>> 
>> Jong Wook
>> 
>> 
>> 
>> On 21 April 2016 at 10:37, DO YUNG YOON <sh...@gmail.com> wrote:
>> 
>>> Thanks for suggestions Jong Wook.
>>> 
>>> I think shading would solve version conflict problems. I am not familiar
>>> with this issue. Jong Wook, can you contribute your knowledge on
>> shading? I
>>> think we should make sure discuss on version conflict problems regardless
>>> native client storage providing. so it would be better to open up
>> separate
>>> thread to discuss it. What do you guys think?
>>> 
>>> As you mentioned, with Native HBase client is all blocking API and What
>> we
>>> end up easily would be enclose blocking API with Scala's Future.
>>> I was wondering if there would be anyone who want to use HBase native
>>> client rather than asynchbase, but since s2graph public interfaces are
>> all
>>> asynchronous, I think there is any benefit to use blocking native client.
>>> 
>>> So you guys are on not provide native client storage? I am +1 on
>> providing
>>> it since it is easy and it is always better to provide more options.
>>> What others think?
>>> 
>>> Best Regards
>>> DOYUNG YOON
>>> 
>>> On Thu, Apr 21, 2016 at 1:17 PM Hyunsung Jo <hy...@gmail.com>
>> wrote:
>>> 
>>>> It seems like Elasticsearch had similar problems:
>>>> https://www.elastic.co/blog/to-shade-or-not-to-shade
>>>> 
>>>> On Wed, Apr 20, 2016 at 10:36 AM Jong Wook Kim <jo...@nyu.edu>
>> wrote:
>>>> 
>>>>> As per Guava version conflict, we should be able to shade the
>>> dependency
>>>>> to another package, maybe with the whole asynchbase together. Guava
>>>>> versions have been the PITA for many other projects too, and usually
>>> got
>>>>> avoided this way.
>>>>> 
>>>>> If we can avoid the Asynchbase+Guava issue by shading them and the
>> only
>>>>> interesting reason left to switch is the benchmark, it might not
>> worth
>>>>> going back to the blocking API as it will require a whole new
>> threading
>>>>> design.
>>>>> 
>>>>> Sincerely,
>>>>> Jong Wook
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Apr 19, 2016, at 9:01 PM, DO YUNG YOON <sh...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Hi All.
>>>>>> 
>>>>>> Since implementing storage becomes easier(I believe), I think it is
>>>> good
>>>>> to
>>>>>> have HBaseStroage which use HBase's native client.
>>>>>> The reason I brought up this is following.
>>>>>> 
>>>>>> 1. in some environment, especially specific Hadoop and Spark
>> cluster
>>>>>> distribution,
>>>>>> s2core have guava version conflict which comes from asynchbase.
>>>>>> - Many cases it is necessary to process stream of edges and write
>>> into
>>>>>> HBase directly on streaming processing.
>>>>>> - Currently, there is no way to specify version to avoid above
>>>> problem.
>>>>>> With Native HBaseClient, users will be select right version for
>> there
>>>>>> environment.
>>>>>> 2. It would be fun to run benchmark on these two client.
>>>>>> 
>>>>>> any feedback would be appreciated.
>>>>>> 
>>>>>> Best Regards.
>>>>>> DOYUNG YOON
>>>>> 
>>>> 
>>> 
>> 


Re: Provide storage which use HBase's native client(not Asynchbase).

Posted by DO YUNG YOON <sh...@gmail.com>.
Agree with avoid unmanaged jars and publishing to maven cental.

As far as I remember, we applied custom patch to control rpc time per
request, but I guess Asynchbase 1.7.1 also support it(not sure though).
Let me check if we are still rely on custom patch. if we don't need custom
patch, I think we should go with
https://github.com/jongwook/asynchbase-shaded.

Thanks for your works, Jong Wook, I will update this after check.


On Mon, May 2, 2016 at 1:18 PM Jong Wook Kim <jo...@nyu.edu> wrote:

> So I went ahead and made an asynchbase package that shades Google Guava,
> and thought it is a good chance to:
>
> - avoid pulling duplicate Netty from two organizations - io.netty and
> org.jboss.netty.
> - remove log4j-over-slf4j from the runtime dependency: this assumes that
> the application is not using log4j in favor of slf4j, and using something
> else like logback. Most Apache hadoop/spark environment unfortunately uses
> log4j, and due to its distributed nature it's not easy to switch. Better
> not enforce anything on logging implementation as a library and let the
> application decide.
>
> I first started off making a Gradle project that pulls the official
> Asynchbase 1.7.1 and relocates packages:
> https://github.com/jongwook/asynchbase-shaded
>
> But then I realized that s2graph was using a custom version of Asynchbase
> for RPC timeout, etc., so I ended up making a fork of SteamShon/asynchbase
> and used maven-shade-plugin to create the shaded jar:
> https://github.com/jongwook/asynchbase
>
> Running make && make pom.xml && mvn -DskipTests=true install will make a
> shaded jar and pom in ~/.m2 provided that protoc 2.5.0 is installed, and
> replacing jars at s2core/lib/ with the shaded jar seems to be working well.
>
> Carrying around unmanaged jars in the repository is not a good idea, so we
> need to consider publishing this version of asynchbase to maven central or
> to some apache repo.
>
> Jong Wook
>
>
>
> On 21 April 2016 at 10:37, DO YUNG YOON <sh...@gmail.com> wrote:
>
> > Thanks for suggestions Jong Wook.
> >
> > I think shading would solve version conflict problems. I am not familiar
> > with this issue. Jong Wook, can you contribute your knowledge on
> shading? I
> > think we should make sure discuss on version conflict problems regardless
> > native client storage providing. so it would be better to open up
> separate
> > thread to discuss it. What do you guys think?
> >
> > As you mentioned, with Native HBase client is all blocking API and What
> we
> > end up easily would be enclose blocking API with Scala's Future.
> > I was wondering if there would be anyone who want to use HBase native
> > client rather than asynchbase, but since s2graph public interfaces are
> all
> > asynchronous, I think there is any benefit to use blocking native client.
> >
> > So you guys are on not provide native client storage? I am +1 on
> providing
> > it since it is easy and it is always better to provide more options.
> > What others think?
> >
> > Best Regards
> > DOYUNG YOON
> >
> > On Thu, Apr 21, 2016 at 1:17 PM Hyunsung Jo <hy...@gmail.com>
> wrote:
> >
> > > It seems like Elasticsearch had similar problems:
> > > https://www.elastic.co/blog/to-shade-or-not-to-shade
> > >
> > > On Wed, Apr 20, 2016 at 10:36 AM Jong Wook Kim <jo...@nyu.edu>
> wrote:
> > >
> > > > As per Guava version conflict, we should be able to shade the
> > dependency
> > > > to another package, maybe with the whole asynchbase together. Guava
> > > > versions have been the PITA for many other projects too, and usually
> > got
> > > > avoided this way.
> > > >
> > > > If we can avoid the Asynchbase+Guava issue by shading them and the
> only
> > > > interesting reason left to switch is the benchmark, it might not
> worth
> > > > going back to the blocking API as it will require a whole new
> threading
> > > > design.
> > > >
> > > > Sincerely,
> > > > Jong Wook
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Apr 19, 2016, at 9:01 PM, DO YUNG YOON <sh...@gmail.com>
> wrote:
> > > > >
> > > > > Hi All.
> > > > >
> > > > > Since implementing storage becomes easier(I believe), I think it is
> > > good
> > > > to
> > > > > have HBaseStroage which use HBase's native client.
> > > > > The reason I brought up this is following.
> > > > >
> > > > > 1. in some environment, especially specific Hadoop and Spark
> cluster
> > > > > distribution,
> > > > > s2core have guava version conflict which comes from asynchbase.
> > > > >  - Many cases it is necessary to process stream of edges and write
> > into
> > > > > HBase directly on streaming processing.
> > > > >  - Currently, there is no way to specify version to avoid above
> > > problem.
> > > > > With Native HBaseClient, users will be select right version for
> there
> > > > > environment.
> > > > > 2. It would be fun to run benchmark on these two client.
> > > > >
> > > > > any feedback would be appreciated.
> > > > >
> > > > > Best Regards.
> > > > > DOYUNG YOON
> > > >
> > >
> >
>

Re: Provide storage which use HBase's native client(not Asynchbase).

Posted by Jong Wook Kim <jo...@nyu.edu>.
So I went ahead and made an asynchbase package that shades Google Guava,
and thought it is a good chance to:

- avoid pulling duplicate Netty from two organizations - io.netty and
org.jboss.netty.
- remove log4j-over-slf4j from the runtime dependency: this assumes that
the application is not using log4j in favor of slf4j, and using something
else like logback. Most Apache hadoop/spark environment unfortunately uses
log4j, and due to its distributed nature it's not easy to switch. Better
not enforce anything on logging implementation as a library and let the
application decide.

I first started off making a Gradle project that pulls the official
Asynchbase 1.7.1 and relocates packages:
https://github.com/jongwook/asynchbase-shaded

But then I realized that s2graph was using a custom version of Asynchbase
for RPC timeout, etc., so I ended up making a fork of SteamShon/asynchbase
and used maven-shade-plugin to create the shaded jar:
https://github.com/jongwook/asynchbase

Running make && make pom.xml && mvn -DskipTests=true install will make a
shaded jar and pom in ~/.m2 provided that protoc 2.5.0 is installed, and
replacing jars at s2core/lib/ with the shaded jar seems to be working well.

Carrying around unmanaged jars in the repository is not a good idea, so we
need to consider publishing this version of asynchbase to maven central or
to some apache repo.

Jong Wook



On 21 April 2016 at 10:37, DO YUNG YOON <sh...@gmail.com> wrote:

> Thanks for suggestions Jong Wook.
>
> I think shading would solve version conflict problems. I am not familiar
> with this issue. Jong Wook, can you contribute your knowledge on shading? I
> think we should make sure discuss on version conflict problems regardless
> native client storage providing. so it would be better to open up separate
> thread to discuss it. What do you guys think?
>
> As you mentioned, with Native HBase client is all blocking API and What we
> end up easily would be enclose blocking API with Scala's Future.
> I was wondering if there would be anyone who want to use HBase native
> client rather than asynchbase, but since s2graph public interfaces are all
> asynchronous, I think there is any benefit to use blocking native client.
>
> So you guys are on not provide native client storage? I am +1 on providing
> it since it is easy and it is always better to provide more options.
> What others think?
>
> Best Regards
> DOYUNG YOON
>
> On Thu, Apr 21, 2016 at 1:17 PM Hyunsung Jo <hy...@gmail.com> wrote:
>
> > It seems like Elasticsearch had similar problems:
> > https://www.elastic.co/blog/to-shade-or-not-to-shade
> >
> > On Wed, Apr 20, 2016 at 10:36 AM Jong Wook Kim <jo...@nyu.edu> wrote:
> >
> > > As per Guava version conflict, we should be able to shade the
> dependency
> > > to another package, maybe with the whole asynchbase together. Guava
> > > versions have been the PITA for many other projects too, and usually
> got
> > > avoided this way.
> > >
> > > If we can avoid the Asynchbase+Guava issue by shading them and the only
> > > interesting reason left to switch is the benchmark, it might not worth
> > > going back to the blocking API as it will require a whole new threading
> > > design.
> > >
> > > Sincerely,
> > > Jong Wook
> > >
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 19, 2016, at 9:01 PM, DO YUNG YOON <sh...@gmail.com> wrote:
> > > >
> > > > Hi All.
> > > >
> > > > Since implementing storage becomes easier(I believe), I think it is
> > good
> > > to
> > > > have HBaseStroage which use HBase's native client.
> > > > The reason I brought up this is following.
> > > >
> > > > 1. in some environment, especially specific Hadoop and Spark cluster
> > > > distribution,
> > > > s2core have guava version conflict which comes from asynchbase.
> > > >  - Many cases it is necessary to process stream of edges and write
> into
> > > > HBase directly on streaming processing.
> > > >  - Currently, there is no way to specify version to avoid above
> > problem.
> > > > With Native HBaseClient, users will be select right version for there
> > > > environment.
> > > > 2. It would be fun to run benchmark on these two client.
> > > >
> > > > any feedback would be appreciated.
> > > >
> > > > Best Regards.
> > > > DOYUNG YOON
> > >
> >
>

Re: Provide storage which use HBase's native client(not Asynchbase).

Posted by DO YUNG YOON <sh...@gmail.com>.
Thanks for suggestions Jong Wook.

I think shading would solve version conflict problems. I am not familiar
with this issue. Jong Wook, can you contribute your knowledge on shading? I
think we should make sure discuss on version conflict problems regardless
native client storage providing. so it would be better to open up separate
thread to discuss it. What do you guys think?

As you mentioned, with Native HBase client is all blocking API and What we
end up easily would be enclose blocking API with Scala's Future.
I was wondering if there would be anyone who want to use HBase native
client rather than asynchbase, but since s2graph public interfaces are all
asynchronous, I think there is any benefit to use blocking native client.

So you guys are on not provide native client storage? I am +1 on providing
it since it is easy and it is always better to provide more options.
What others think?

Best Regards
DOYUNG YOON

On Thu, Apr 21, 2016 at 1:17 PM Hyunsung Jo <hy...@gmail.com> wrote:

> It seems like Elasticsearch had similar problems:
> https://www.elastic.co/blog/to-shade-or-not-to-shade
>
> On Wed, Apr 20, 2016 at 10:36 AM Jong Wook Kim <jo...@nyu.edu> wrote:
>
> > As per Guava version conflict, we should be able to shade the dependency
> > to another package, maybe with the whole asynchbase together. Guava
> > versions have been the PITA for many other projects too, and usually got
> > avoided this way.
> >
> > If we can avoid the Asynchbase+Guava issue by shading them and the only
> > interesting reason left to switch is the benchmark, it might not worth
> > going back to the blocking API as it will require a whole new threading
> > design.
> >
> > Sincerely,
> > Jong Wook
> >
> >
> > Sent from my iPhone
> >
> > > On Apr 19, 2016, at 9:01 PM, DO YUNG YOON <sh...@gmail.com> wrote:
> > >
> > > Hi All.
> > >
> > > Since implementing storage becomes easier(I believe), I think it is
> good
> > to
> > > have HBaseStroage which use HBase's native client.
> > > The reason I brought up this is following.
> > >
> > > 1. in some environment, especially specific Hadoop and Spark cluster
> > > distribution,
> > > s2core have guava version conflict which comes from asynchbase.
> > >  - Many cases it is necessary to process stream of edges and write into
> > > HBase directly on streaming processing.
> > >  - Currently, there is no way to specify version to avoid above
> problem.
> > > With Native HBaseClient, users will be select right version for there
> > > environment.
> > > 2. It would be fun to run benchmark on these two client.
> > >
> > > any feedback would be appreciated.
> > >
> > > Best Regards.
> > > DOYUNG YOON
> >
>

Re: Provide storage which use HBase's native client(not Asynchbase).

Posted by Hyunsung Jo <hy...@gmail.com>.
It seems like Elasticsearch had similar problems:
https://www.elastic.co/blog/to-shade-or-not-to-shade

On Wed, Apr 20, 2016 at 10:36 AM Jong Wook Kim <jo...@nyu.edu> wrote:

> As per Guava version conflict, we should be able to shade the dependency
> to another package, maybe with the whole asynchbase together. Guava
> versions have been the PITA for many other projects too, and usually got
> avoided this way.
>
> If we can avoid the Asynchbase+Guava issue by shading them and the only
> interesting reason left to switch is the benchmark, it might not worth
> going back to the blocking API as it will require a whole new threading
> design.
>
> Sincerely,
> Jong Wook
>
>
> Sent from my iPhone
>
> > On Apr 19, 2016, at 9:01 PM, DO YUNG YOON <sh...@gmail.com> wrote:
> >
> > Hi All.
> >
> > Since implementing storage becomes easier(I believe), I think it is good
> to
> > have HBaseStroage which use HBase's native client.
> > The reason I brought up this is following.
> >
> > 1. in some environment, especially specific Hadoop and Spark cluster
> > distribution,
> > s2core have guava version conflict which comes from asynchbase.
> >  - Many cases it is necessary to process stream of edges and write into
> > HBase directly on streaming processing.
> >  - Currently, there is no way to specify version to avoid above problem.
> > With Native HBaseClient, users will be select right version for there
> > environment.
> > 2. It would be fun to run benchmark on these two client.
> >
> > any feedback would be appreciated.
> >
> > Best Regards.
> > DOYUNG YOON
>

Re: Provide storage which use HBase's native client(not Asynchbase).

Posted by Jong Wook Kim <jo...@nyu.edu>.
As per Guava version conflict, we should be able to shade the dependency to another package, maybe with the whole asynchbase together. Guava versions have been the PITA for many other projects too, and usually got avoided this way. 

If we can avoid the Asynchbase+Guava issue by shading them and the only interesting reason left to switch is the benchmark, it might not worth going back to the blocking API as it will require a whole new threading design. 

Sincerely,
Jong Wook


Sent from my iPhone

> On Apr 19, 2016, at 9:01 PM, DO YUNG YOON <sh...@gmail.com> wrote:
> 
> Hi All.
> 
> Since implementing storage becomes easier(I believe), I think it is good to
> have HBaseStroage which use HBase's native client.
> The reason I brought up this is following.
> 
> 1. in some environment, especially specific Hadoop and Spark cluster
> distribution,
> s2core have guava version conflict which comes from asynchbase.
>  - Many cases it is necessary to process stream of edges and write into
> HBase directly on streaming processing.
>  - Currently, there is no way to specify version to avoid above problem.
> With Native HBaseClient, users will be select right version for there
> environment.
> 2. It would be fun to run benchmark on these two client.
> 
> any feedback would be appreciated.
> 
> Best Regards.
> DOYUNG YOON