You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Jonathan Hsieh <jo...@cloudera.com> on 2012/03/27 20:49:36 UTC

Notes from Nicolas + Amir from Facebook @ Cloudera

People:
Cloudera: Todd, Dave W, Shaneel M, Jonathan H, Himanshu, Greg C, Matteo B
(remote)
FB: Nicolas, Amir

druba - ubase/hstore - transactin processing, through hive-hbase
integration.

hbase team with hdfs team.
- hbase deman was with hadoop

NY - carve out hunk of HBase to work on.

Long term:
real time hive, deep integration.
- beyond just translate to MR job.
- Use in megastore.
- scan kind of halfass, for hive.
- previously point query optimization.
- analystics too long to scan table.
- doing on demand compression.

Edgecases
- finace sector
- gpu cases.

Uptime and availaiblity.
- chaos monkey
- poll all regions

Hbase 0.89 - fast region failover.
- down time down to..

Take down rack - test cases

putting data node selection in master.
- on per region basis, hash chain - so assigned secondary and tertiary.

What is Cloudera focus?

HDFS HA story
- Talking to HW -- bookies in HDFS ("public story, but ...")
- logs in hdfs.
- Standby node.
- zk flag - halfass solution. "double fails" not in scope.
- todd: 3 journal daemons, quorom for edits, pluggable journal manager
interface.

Facebook - new data infrastructure
- focus on quality, reliability, visibility.
- upping rolling restart to improve monitoring

HBase - stable depends on use case
- pushing out use cases
- ODS, (soon)
- Puma analytics
- ubase - researchy
- site integrity
- hash out cluster (generic kv store, persistent memcache ), multi-tenant
cluster, "photo stuff" (haystack)
- wormhole - backup replication - on hashout cluster, master slave, cross
DC replication.

Replication  - talk to Madu

HDFS hard links - on github.
- at data node layer.
- hari m - HW - hard links also. (claims working prototype)

Kannan -

pubsub,
2ndary index.
native c++ thrift client.
open sourcing folly (c++ stl)

- distrbute log splitting task manager
- ordering for bulk master operations, eliminate class of problems.

Online schema changes
- high friction to change
- check column descriptor, then table, then configuraiton.
- tune new features for column family.

FB doesn't care about access control.
- auditing - multi tenancy case.
- specific app servers that will access - perms
- FB will do security at a higher level



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Notes from Nicolas + Amir from Facebook @ Cloudera

Posted by Jonathan Hsieh <jo...@cloudera.com>.

A explanation about these rough notes and how they got sent out.

Nicolas from Facebook was in the Palo Alto area from out-of-town so we
invited him to drop by Cloudera to meet some of the team at Cloudera
working on HBase during one of our weekly team meetings.  Naturally, we
ended up chatting a bit about the areas our teams are working on at our
respective companies, and a bit about features that were relevant to our
customers / use cases.  These features and ideas discussed are things that
are already out on public jiras, mailing lists, or github repos.

I scribbled some rough notes, and was later asked to send them out to the
attendees.   I intended to send them to our internal team's mailing list
and to the Facebook folks who visited.  Instead, I mistakenly sent it out
to the public dev list.

Jon.

On Tue, Mar 27, 2012 at 12:12 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> sorry guys I sent notes out to the wrong list.
>
> Sent from my iPhone
>
> On Mar 27, 2012, at 11:54, Ted Yu <yu...@gmail.com> wrote:
>
> > Can someone explain the notes below ?
> >
> > bq. - hbase deman was with hadoop
> >
> > bq. - scan kind of halfass, for hive.
> >
> > bq. - finace sector
> >
> > Thanks
> >
> > On Tue, Mar 27, 2012 at 11:49 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> >> People:
> >> Cloudera: Todd, Dave W, Shaneel M, Jonathan H, Himanshu, Greg C, Matteo
> B
> >> (remote)
> >> FB: Nicolas, Amir
> >>
> >> druba - ubase/hstore - transactin processing, through hive-hbase
> >> integration.
> >>
> >> hbase team with hdfs team.
> >> - hbase deman was with hadoop
> >>
> >> NY - carve out hunk of HBase to work on.
> >>
> >> Long term:
> >> real time hive, deep integration.
> >> - beyond just translate to MR job.
> >> - Use in megastore.
> >> - scan kind of halfass, for hive.
> >> - previously point query optimization.
> >> - analystics too long to scan table.
> >> - doing on demand compression.
> >>
> >> Edgecases
> >> - finace sector
> >> - gpu cases.
> >>
> >> Uptime and availaiblity.
> >> - chaos monkey
> >> - poll all regions
> >>
> >> Hbase 0.89 - fast region failover.
> >> - down time down to..
> >>
> >> Take down rack - test cases
> >>
> >> putting data node selection in master.
> >> - on per region basis, hash chain - so assigned secondary and tertiary.
> >>
> >> What is Cloudera focus?
> >>
> >> HDFS HA story
> >> - Talking to HW -- bookies in HDFS ("public story, but ...")
> >> - logs in hdfs.
> >> - Standby node.
> >> - zk flag - halfass solution. "double fails" not in scope.
> >> - todd: 3 journal daemons, quorom for edits, pluggable journal manager
> >> interface.
> >>
> >> Facebook - new data infrastructure
> >> - focus on quality, reliability, visibility.
> >> - upping rolling restart to improve monitoring
> >>
> >> HBase - stable depends on use case
> >> - pushing out use cases
> >> - ODS, (soon)
> >> - Puma analytics
> >> - ubase - researchy
> >> - site integrity
> >> - hash out cluster (generic kv store, persistent memcache ),
> multi-tenant
> >> cluster, "photo stuff" (haystack)
> >> - wormhole - backup replication - on hashout cluster, master slave,
> cross
> >> DC replication.
> >>
> >> Replication  - talk to Madu
> >>
> >> HDFS hard links - on github.
> >> - at data node layer.
> >> - hari m - HW - hard links also. (claims working prototype)
> >>
> >> Kannan -
> >>
> >> pubsub,
> >> 2ndary index.
> >> native c++ thrift client.
> >> open sourcing folly (c++ stl)
> >>
> >> - distrbute log splitting task manager
> >> - ordering for bulk master operations, eliminate class of problems.
> >>
> >> Online schema changes
> >> - high friction to change
> >> - check column descriptor, then table, then configuraiton.
> >> - tune new features for column family.
> >>
> >> FB doesn't care about access control.
> >> - auditing - multi tenancy case.
> >> - specific app servers that will access - perms
> >> - FB will do security at a higher level
> >>
> >>
> >>
> >> --
> >> // Jonathan Hsieh (shay)
> >> // Software Engineer, Cloudera
> >> // jon@cloudera.com
> >>
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Notes from Nicolas + Amir from Facebook @ Cloudera

Posted by Jonathan Hsieh <jo...@cloudera.com>.

sorry guys I sent notes out to the wrong list. 

Sent from my iPhone

On Mar 27, 2012, at 11:54, Ted Yu <yu...@gmail.com> wrote:

> Can someone explain the notes below ?
> 
> bq. - hbase deman was with hadoop
> 
> bq. - scan kind of halfass, for hive.
> 
> bq. - finace sector
> 
> Thanks
> 
> On Tue, Mar 27, 2012 at 11:49 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> 
>> People:
>> Cloudera: Todd, Dave W, Shaneel M, Jonathan H, Himanshu, Greg C, Matteo B
>> (remote)
>> FB: Nicolas, Amir
>> 
>> druba - ubase/hstore - transactin processing, through hive-hbase
>> integration.
>> 
>> hbase team with hdfs team.
>> - hbase deman was with hadoop
>> 
>> NY - carve out hunk of HBase to work on.
>> 
>> Long term:
>> real time hive, deep integration.
>> - beyond just translate to MR job.
>> - Use in megastore.
>> - scan kind of halfass, for hive.
>> - previously point query optimization.
>> - analystics too long to scan table.
>> - doing on demand compression.
>> 
>> Edgecases
>> - finace sector
>> - gpu cases.
>> 
>> Uptime and availaiblity.
>> - chaos monkey
>> - poll all regions
>> 
>> Hbase 0.89 - fast region failover.
>> - down time down to..
>> 
>> Take down rack - test cases
>> 
>> putting data node selection in master.
>> - on per region basis, hash chain - so assigned secondary and tertiary.
>> 
>> What is Cloudera focus?
>> 
>> HDFS HA story
>> - Talking to HW -- bookies in HDFS ("public story, but ...")
>> - logs in hdfs.
>> - Standby node.
>> - zk flag - halfass solution. "double fails" not in scope.
>> - todd: 3 journal daemons, quorom for edits, pluggable journal manager
>> interface.
>> 
>> Facebook - new data infrastructure
>> - focus on quality, reliability, visibility.
>> - upping rolling restart to improve monitoring
>> 
>> HBase - stable depends on use case
>> - pushing out use cases
>> - ODS, (soon)
>> - Puma analytics
>> - ubase - researchy
>> - site integrity
>> - hash out cluster (generic kv store, persistent memcache ), multi-tenant
>> cluster, "photo stuff" (haystack)
>> - wormhole - backup replication - on hashout cluster, master slave, cross
>> DC replication.
>> 
>> Replication  - talk to Madu
>> 
>> HDFS hard links - on github.
>> - at data node layer.
>> - hari m - HW - hard links also. (claims working prototype)
>> 
>> Kannan -
>> 
>> pubsub,
>> 2ndary index.
>> native c++ thrift client.
>> open sourcing folly (c++ stl)
>> 
>> - distrbute log splitting task manager
>> - ordering for bulk master operations, eliminate class of problems.
>> 
>> Online schema changes
>> - high friction to change
>> - check column descriptor, then table, then configuraiton.
>> - tune new features for column family.
>> 
>> FB doesn't care about access control.
>> - auditing - multi tenancy case.
>> - specific app servers that will access - perms
>> - FB will do security at a higher level
>> 
>> 
>> 
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com
>>

Re: Notes from Nicolas + Amir from Facebook @ Cloudera

Posted by Ted Yu <yu...@gmail.com>.

Can someone explain the notes below ?

bq. - hbase deman was with hadoop

bq. - scan kind of halfass, for hive.

bq. - finace sector

Thanks

On Tue, Mar 27, 2012 at 11:49 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> People:
> Cloudera: Todd, Dave W, Shaneel M, Jonathan H, Himanshu, Greg C, Matteo B
> (remote)
> FB: Nicolas, Amir
>
> druba - ubase/hstore - transactin processing, through hive-hbase
> integration.
>
> hbase team with hdfs team.
> - hbase deman was with hadoop
>
> NY - carve out hunk of HBase to work on.
>
> Long term:
> real time hive, deep integration.
> - beyond just translate to MR job.
> - Use in megastore.
> - scan kind of halfass, for hive.
> - previously point query optimization.
> - analystics too long to scan table.
> - doing on demand compression.
>
> Edgecases
> - finace sector
> - gpu cases.
>
> Uptime and availaiblity.
> - chaos monkey
> - poll all regions
>
> Hbase 0.89 - fast region failover.
> - down time down to..
>
> Take down rack - test cases
>
> putting data node selection in master.
> - on per region basis, hash chain - so assigned secondary and tertiary.
>
> What is Cloudera focus?
>
> HDFS HA story
> - Talking to HW -- bookies in HDFS ("public story, but ...")
> - logs in hdfs.
> - Standby node.
> - zk flag - halfass solution. "double fails" not in scope.
> - todd: 3 journal daemons, quorom for edits, pluggable journal manager
> interface.
>
> Facebook - new data infrastructure
> - focus on quality, reliability, visibility.
> - upping rolling restart to improve monitoring
>
> HBase - stable depends on use case
> - pushing out use cases
> - ODS, (soon)
> - Puma analytics
> - ubase - researchy
> - site integrity
> - hash out cluster (generic kv store, persistent memcache ), multi-tenant
> cluster, "photo stuff" (haystack)
> - wormhole - backup replication - on hashout cluster, master slave, cross
> DC replication.
>
> Replication  - talk to Madu
>
> HDFS hard links - on github.
> - at data node layer.
> - hari m - HW - hard links also. (claims working prototype)
>
> Kannan -
>
> pubsub,
> 2ndary index.
> native c++ thrift client.
> open sourcing folly (c++ stl)
>
> - distrbute log splitting task manager
> - ordering for bulk master operations, eliminate class of problems.
>
> Online schema changes
> - high friction to change
> - check column descriptor, then table, then configuraiton.
> - tune new features for column family.
>
> FB doesn't care about access control.
> - auditing - multi tenancy case.
> - specific app servers that will access - perms
> - FB will do security at a higher level
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>