You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Stack <st...@duboce.net> on 2016/04/22 21:07:27 UTC

Re: Google Summer Of Code 2016

Congrats Talat. You are our GSoC. We'll try and be nice (smile).
St.Ack

On Fri, Mar 25, 2016 at 1:52 PM, Stack <st...@duboce.net> wrote:

> Thanks Talat... I shoved some comments up in it but looks basically sound.
> Thanks for sending it in.
> St.Ack
>
> On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer <ta...@uyarer.com> wrote:
>
>> Hi all,
>>
>> I created my GSoC proposal for Block Encoding and Compression for RPC
>> Layer[1]. If you review and share your comments I will be appreciated.
>>
>> [1]
>> https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing
>> [2] https://issues.apache.org/jira/browse/HBASE-15530
>>
>> Thanks
>>
>> On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer <ta...@uyarer.com> wrote:
>> > Hi,
>> >
>> > I am appreciated to being mentor Stack :) As I know as ASF already
>> > participate and you can sign up. [1] last year I was a mentor. I just
>> > send an email to private and mentors@community.apache.org. Would you
>> > like to check it ?
>> >
>> > [1]
>> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>> >
>> > 2016-03-22 17:32 GMT-07:00 Enis Söztutar <en...@gmail.com>:
>> >>>
>> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is
>> it too
>> >>> late for us to participate now?
>> >>>
>> >>>
>> >> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>> >>
>> >>
>> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed
>> the
>> >>> mentor signup deadline.
>> >>>
>> >>
>> >> I did not check the deadline, if that is the case, it means this year
>> is
>> >> over?
>> >>
>> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>> >>
>> >>
>> >>>
>> >>>
>> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC.
>> These
>> >>> > are:
>> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>> >>> > encoding. But these encodings just can use in HFile context. In RPC
>> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't
>> say
>> >>> > the issue number But I guessed it is HBASE-12883 Support block
>> >>> > encoding based on knowing set of column qualifiers up front)
>> >>> >
>> >>>
>> >>> Sounds like a fine project (Someone was just asking about this
>> offline...)
>> >>>
>> >>>
>> >>>
>> >>> > - HBASE-14379 Replication V2
>> >>> > - HBASE-8691 High-Throughput Streaming Scan API
>> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase
>> ->
>> >>> > SOLR indexing. I guess it could be this issue.)
>> >>> >
>> >>> > Could you help me for selecting topics or could you offer another
>> issue ?
>> >>> >
>> >>> >
>> >>> All above are good.
>> >>>
>> >>> Here's a few others made for another context:
>> >>>
>> >>> + Become Jepsen distributed systems test tool expert: run it against
>> HBase
>> >>> and HDFS. Analyze results. E.g. see
>> >>>
>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>> >>> + Deep dive on hbase Compactions. Own it. Review current options both
>> the
>> >>> defaults, experimental, and the stale. Build tooling and surface
>> metrics
>> >>> that give better insight on effectiveness of compaction mechanics and
>> >>> policies. Develop tunings and alternate, new policies. For further
>> credit,
>> >>> develop master-orchestrated compaction algorithm.
>> >>> + Reimplement HBase append and increment as write-only with rollup on
>> read
>> >>> or using CRDTs (
>> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>> >>> + Make the HBase Server async/event driven/SEDA moving it off its
>> current
>> >>> thread-per-request basis
>> >>> + UI: build out more pages and tabs on the HBase master exposing more
>> of
>> >>> our cluster metrics (make the master into a metrics sink). Extra
>> points for
>> >>> views, histograms, or dashboards that are both informative AND pretty
>> (D3,
>> >>> etc.). A good benchmark would be subsuming the Hannibal tool
>> >>> https://github.com/sentric/hannibal
>> >>> + Build an example application on HBase for test and illustration:
>> e.g. use
>> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase
>> to
>> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>> >>> hbase
>> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>> >>> credit for documenting steps involved and filing issues where API is
>> >>> awkward or hard to follow.
>> >>> + Add actionable statistics to hbase internals that capture vitals
>> about
>> >>> the data being served and that we exploit responding to queries; e.g.
>> rough
>> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>> >>> example, if a client has been stepping sequentially through the data,
>> the
>> >>> stats would allow us recognize this state so we could switch to a
>> different
>> >>> scan type; one that is optimal to a sequential progression.
>> >>> + Review and redo our fundamental merge sort, the basis of our read.
>> There
>> >>> are a few techniques to try such as a "loser tree merge" (
>> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally
>> we'd
>> >>> make
>> >>> our merge sort block-based rather than Cell-based. Set yourself up in
>> a rig
>> >>> and try different Cell formats to get yourself to a cache-friendly
>> Cell
>> >>> format that maximizes instructions per cycle.
>> >>> + Our client is heavy-weight and has accumulated lots of logic over
>> time.
>> >>> E.g. it is hard to set a single timeout for a request because client
>> is
>> >>> layered each with its own running timeouts. At its core is a
>> mostly-done
>> >>> async engine. Review, and finish the async work. Rewrite where it
>> makes
>> >>> sense after analysis.
>> >>> + Our RPC is based on protobuf Service where we plugged in our own RPC
>> >>> transport. An exploratory PoC putting HBase up on grpc was done by
>> the grpc
>> >>> team. Bring this project home. Extra points if you reveal a Streaming
>> >>> Interface between Client and Server.
>> >>> + Tiering... if regions are cold, close them so they don't occupy
>> resources
>> >>> (close files, purge its data from cache...).... reopen when a request
>> comes
>> >>> in....
>> >>> + Dynamic configuration of running HBase
>> >>>
>> >>>
>> >>> St.Ack
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> > Thanks
>> >>> > --
>> >>> > Talat UYARER
>> >>> >
>> >>>
>> >
>> >
>> >
>> > --
>> > Talat UYARER
>> > Websitesi: http://talat.uyarer.com
>> > Twitter: http://twitter.com/talatuyarer
>> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>> >
>> > On Tue, Mar 22, 2016 at 5:32 PM, Enis Söztutar <en...@gmail.com>
>> wrote:
>> >>>
>> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is
>> it too
>> >>> late for us to participate now?
>> >>>
>> >>>
>> >> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>> >>
>> >>
>> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed
>> the
>> >>> mentor signup deadline.
>> >>>
>> >>
>> >> I did not check the deadline, if that is the case, it means this year
>> is
>> >> over?
>> >>
>> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>> >>
>> >>
>> >>>
>> >>>
>> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC.
>> These
>> >>> > are:
>> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>> >>> > encoding. But these encodings just can use in HFile context. In RPC
>> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't
>> say
>> >>> > the issue number But I guessed it is HBASE-12883 Support block
>> >>> > encoding based on knowing set of column qualifiers up front)
>> >>> >
>> >>>
>> >>> Sounds like a fine project (Someone was just asking about this
>> offline...)
>> >>>
>> >>>
>> >>>
>> >>> > - HBASE-14379 Replication V2
>> >>> > - HBASE-8691 High-Throughput Streaming Scan API
>> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase
>> ->
>> >>> > SOLR indexing. I guess it could be this issue.)
>> >>> >
>> >>> > Could you help me for selecting topics or could you offer another
>> issue ?
>> >>> >
>> >>> >
>> >>> All above are good.
>> >>>
>> >>> Here's a few others made for another context:
>> >>>
>> >>> + Become Jepsen distributed systems test tool expert: run it against
>> HBase
>> >>> and HDFS. Analyze results. E.g. see
>> >>>
>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>> >>> + Deep dive on hbase Compactions. Own it. Review current options both
>> the
>> >>> defaults, experimental, and the stale. Build tooling and surface
>> metrics
>> >>> that give better insight on effectiveness of compaction mechanics and
>> >>> policies. Develop tunings and alternate, new policies. For further
>> credit,
>> >>> develop master-orchestrated compaction algorithm.
>> >>> + Reimplement HBase append and increment as write-only with rollup on
>> read
>> >>> or using CRDTs (
>> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>> >>> + Make the HBase Server async/event driven/SEDA moving it off its
>> current
>> >>> thread-per-request basis
>> >>> + UI: build out more pages and tabs on the HBase master exposing more
>> of
>> >>> our cluster metrics (make the master into a metrics sink). Extra
>> points for
>> >>> views, histograms, or dashboards that are both informative AND pretty
>> (D3,
>> >>> etc.). A good benchmark would be subsuming the Hannibal tool
>> >>> https://github.com/sentric/hannibal
>> >>> + Build an example application on HBase for test and illustration:
>> e.g. use
>> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase
>> to
>> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>> >>> hbase
>> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>> >>> credit for documenting steps involved and filing issues where API is
>> >>> awkward or hard to follow.
>> >>> + Add actionable statistics to hbase internals that capture vitals
>> about
>> >>> the data being served and that we exploit responding to queries; e.g.
>> rough
>> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>> >>> example, if a client has been stepping sequentially through the data,
>> the
>> >>> stats would allow us recognize this state so we could switch to a
>> different
>> >>> scan type; one that is optimal to a sequential progression.
>> >>> + Review and redo our fundamental merge sort, the basis of our read.
>> There
>> >>> are a few techniques to try such as a "loser tree merge" (
>> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally
>> we'd
>> >>> make
>> >>> our merge sort block-based rather than Cell-based. Set yourself up in
>> a rig
>> >>> and try different Cell formats to get yourself to a cache-friendly
>> Cell
>> >>> format that maximizes instructions per cycle.
>> >>> + Our client is heavy-weight and has accumulated lots of logic over
>> time.
>> >>> E.g. it is hard to set a single timeout for a request because client
>> is
>> >>> layered each with its own running timeouts. At its core is a
>> mostly-done
>> >>> async engine. Review, and finish the async work. Rewrite where it
>> makes
>> >>> sense after analysis.
>> >>> + Our RPC is based on protobuf Service where we plugged in our own RPC
>> >>> transport. An exploratory PoC putting HBase up on grpc was done by
>> the grpc
>> >>> team. Bring this project home. Extra points if you reveal a Streaming
>> >>> Interface between Client and Server.
>> >>> + Tiering... if regions are cold, close them so they don't occupy
>> resources
>> >>> (close files, purge its data from cache...).... reopen when a request
>> comes
>> >>> in....
>> >>> + Dynamic configuration of running HBase
>> >>>
>> >>>
>> >>> St.Ack
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> > Thanks
>> >>> > --
>> >>> > Talat UYARER
>> >>> >
>> >>>
>> >
>> >
>> >
>> > --
>> > Talat UYARER
>> > Websitesi: http://talat.uyarer.com
>> > Twitter: http://twitter.com/talatuyarer
>> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>>
>>
>>
>> --
>> Talat UYARER
>> Websitesi: http://talat.uyarer.com
>> Twitter: http://twitter.com/talatuyarer
>> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>>
>
>

Re: Google Summer Of Code 2016

Posted by Enis Söztutar <en...@apache.org>.

Cool. Congrats.

Enis

On Fri, Apr 22, 2016 at 2:56 PM, Talat Uyarer <ta...@uyarer.com> wrote:

> Hi all,
> I really thankful for Apache HBase Community for sharing your ideas and
> accepting my GSoC 2016 proposal. Specially
> thankful for Enis to shared a good idea and Stack for volunteering to
> mentor my project.
>
> I am really excited to work with you :)
>
> Talat
> On Apr 22, 2016 12:22 PM, "Elliott Clark" <ec...@apache.org> wrote:
>
> > On Fri, Apr 22, 2016 at 12:07 PM, Stack <st...@duboce.net> wrote:
> >
> > > Congrats Talat. You are our GSoC. We'll try and be nice (smile).
> > >
> >
> > Congrats. That's awesome!
> >
>

Re: Google Summer Of Code 2016

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi all,
I really thankful for Apache HBase Community for sharing your ideas and
accepting my GSoC 2016 proposal. Specially
thankful for Enis to shared a good idea and Stack for volunteering to
mentor my project.

I am really excited to work with you :)

Talat
On Apr 22, 2016 12:22 PM, "Elliott Clark" <ec...@apache.org> wrote:

> On Fri, Apr 22, 2016 at 12:07 PM, Stack <st...@duboce.net> wrote:
>
> > Congrats Talat. You are our GSoC. We'll try and be nice (smile).
> >
>
> Congrats. That's awesome!
>

Re: Google Summer Of Code 2016

Posted by Elliott Clark <ec...@apache.org>.

On Fri, Apr 22, 2016 at 12:07 PM, Stack <st...@duboce.net> wrote:

> Congrats Talat. You are our GSoC. We'll try and be nice (smile).
>

Congrats. That's awesome!