You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Dikang Gu <di...@gmail.com> on 2017/10/04 20:27:19 UTC

Cassandra pluggable storage engine (update)

Hello C* developers:

In my previous email (
https://www.mail-archive.com/dev@cassandra.apache.org/msg11024.html), I
presented that Instagram was kicking off a project to make C*'s storage
engine to be pluggable, as other modern databases, like mysql, mongoDB etc,
so that users will be able to choose most suitable storage engine for
different work load, or to use different features. In addition to that, a
pluggable storage engine architecture will improve the modularity of the
system, help to increase the testability and reliability of Cassandra.

After months of development and testing, we'd like to share the work we
have done, including the first(draft) version of the C* storage engine API,
and the first version of the RocksDB based storage engine.




For the C* storage engine API, here is the draft version we proposed,
https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-SR9O20jud_0jnA-mW7ttp2dVmk/edit.
It contains the APIs for read/write requests, streaming, and table
management. The storage engine related functionalities, like data
encoding/decoding format, on-disk data read/write, compaction, etc, will be
taken care by the storage engine implementation.

Each storage engine is a class with each instance of the class is stored in
the Keyspace instance. So all the column families within a keyspace will
share one storage engine instance.

Once a storage engine instance is created, Cassandra sever issues commands
to the engine instance to performance data storage and retrieval tasks such
as opening a column family, managing column families and streaming.

How to config storage engine for different keyspaces? It's still open for
discussion. One proposal is that we can add the storage engine option in
the create keyspace cql command, and potentially we can overwrite the
option per C* node in its config file.

Under that API, we implemented a new storage engine, based on RocksDB,
called RocksEngine. In long term, we want to support most of C* existing
features in RocksEngine, and we want to build it in a progressive manner.
For the first version of the RocksDBEngine, we support following features:

   - Most of non-nested data types
   - Table schema
   - Point query
   - Range query
   - Mutations
   - Timestamp
   - TTL
   - Deletions/Cell tombstones
   - Streaming

We do not supported following features in first version yet:

   - Multi-partition query
   - Nested data types
   - Counters
   - Range tombstone
   - Materialized views
   - Secondary indexes
   - SASI
   - Repair

At this moment, we've implemented the V1 features, and deployed it to our
shadow cluster. Using shadowing traffic of our production use cases, we saw
~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here are
some detailed metrics:
https://docs.google.com/document/d/1DojHPteDPSphO0_N2meZ3zkmqlidRwwe_cJpsXLcp10.


So if you need the features in existing storage engine, please keep using
the existing storage engine. If you want to have a more predictable and
lower read latency, also the features supported by RocksEngine are enough
for your use cases, then RocksEngine could be a fit for you.

The work is 1% finished, and we want to work together with community to
make it happen. We presented the work in NGCC last week, and also pushed
the beta version of the pluggable storage engine to Instagram github
Cassandra repo, rocks_3.0 branch (
https://github.com/Instagram/cassandra/tree/rocks_3.0), which is based on
C* 3.0.12, please feel free to play with it! You can download it and follow
the instructions (
https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md) to
try it out in your test environment, your feedback will be very valuable to
us.

Thanks
Dikang.

Re: Cassandra pluggable storage engine (update)

Posted by Dikang Gu <di...@gmail.com>.

Hi DuyHai,

Good point! At this moment, I do not see anything really prevent us from
having one storage engine type per table, we are using one RocksDB instance
per table anyway. However, we want to do simple things first, and it's
easier for us to have storage engine per keyspace, for both development and
our internal deployment. We can revisit the choice if there are strong
needs for storage engine per table.

Thanks
Dikang.

On Wed, Oct 4, 2017 at 1:54 PM, DuyHai Doan <do...@gmail.com> wrote:

> Excellent docs, thanks for the update Dikang.
>
> A question about a design choice, is there any technical reason to specify
> the storage engine at keyspace level rather than table level ?
>
> It's not overly complicated to move all tables sharing the same storage
> engine into the same keyspace but then it makes tables organization
> strongly tied to technical storage engine choice rather than functional
> splitting
>
> Regards
>
> On Wed, Oct 4, 2017 at 10:47 PM, Dikang Gu <di...@gmail.com> wrote:
>
> > Hi Blake,
> >
> > Great questions!
> >
> > 1. Yeah, we implement the encoding algorithms, which could encode C* data
> > types into byte array, and keep the same sorting order. Our
> implementation
> > is based on the orderly lib used in HBase,
> > https://github.com/ndimiduk/orderly .
> > 2. Repair is not supported yet, we are still working on figure out the
> work
> > need to be done to support repair or incremental repair.
> >
> > Thanks
> > Dikang.
> >
> > On Wed, Oct 4, 2017 at 1:39 PM, Blake Eggleston <be...@apple.com>
> > wrote:
> >
> > > Hi Dikang,
> > >
> > > Cool stuff. 2 questions. Based on your presentation at ngcc, it seems
> > like
> > > rocks db stores things in byte order. Does this mean that you have code
> > > that makes each of the existing types byte comparable, or is clustering
> > > order implementation dependent? Also, I don't see anything in the draft
> > api
> > > that seems to support splitting the data set into arbitrary categories
> > (ie
> > > repaired and unrepaired data living in the same token range). Is
> support
> > > for incremental repair planned for v1?
> > >
> > > Thanks,
> > >
> > > Blake
> > >
> > >
> > > On October 4, 2017 at 1:28:01 PM, Dikang Gu (dikang85@gmail.com)
> wrote:
> > >
> > > Hello C* developers:
> > >
> > > In my previous email (https://www.mail-archive.com/
> > > dev@cassandra.apache.org/msg11024.html), I presented that Instagram
> was
> > > kicking off a project to make C*'s storage engine to be pluggable, as
> > other
> > > modern databases, like mysql, mongoDB etc, so that users will be able
> to
> > > choose most suitable storage engine for different work load, or to use
> > > different features. In addition to that, a pluggable storage engine
> > > architecture will improve the modularity of the system, help to
> increase
> > > the testability and reliability of Cassandra.
> > >
> > > After months of development and testing, we'd like to share the work we
> > > have done, including the first(draft) version of the C* storage engine
> > API,
> > > and the first version of the RocksDB based storage engine.
> > >
> > > 
> > >
> > >
> > > For the C* storage engine API, here is the draft version we proposed,
> > > https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-
> > > SR9O20jud_0jnA-mW7ttp2dVmk/edit. It contains the APIs for read/write
> > > requests, streaming, and table management. The storage engine related
> > > functionalities, like data encoding/decoding format, on-disk data
> > > read/write, compaction, etc, will be taken care by the storage engine
> > > implementation.
> > >
> > > Each storage engine is a class with each instance of the class is
> stored
> > > in the Keyspace instance. So all the column families within a keyspace
> > will
> > > share one storage engine instance.
> > >
> > > Once a storage engine instance is created, Cassandra sever issues
> > commands
> > > to the engine instance to performance data storage and retrieval tasks
> > such
> > > as opening a column family, managing column families and streaming.
> > >
> > > How to config storage engine for different keyspaces? It's still open
> for
> > > discussion. One proposal is that we can add the storage engine option
> in
> > > the create keyspace cql command, and potentially we can overwrite the
> > > option per C* node in its config file.
> > >
> > > Under that API, we implemented a new storage engine, based on RocksDB,
> > > called RocksEngine. In long term, we want to support most of C*
> existing
> > > features in RocksEngine, and we want to build it in a progressive
> manner.
> > > For the first version of the RocksDBEngine, we support following
> > features:
> > > Most of non-nested data types
> > > Table schema
> > > Point query
> > > Range query
> > > Mutations
> > > Timestamp
> > > TTL
> > > Deletions/Cell tombstones
> > > Streaming
> > > We do not supported following features in first version yet:
> > > Multi-partition query
> > > Nested data types
> > > Counters
> > > Range tombstone
> > > Materialized views
> > > Secondary indexes
> > > SASI
> > > Repair
> > > At this moment, we've implemented the V1 features, and deployed it to
> our
> > > shadow cluster. Using shadowing traffic of our production use cases, we
> > saw
> > > ~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here
> are
> > > some detailed metrics: https://docs.google.com/
> > document/d/1DojHPteDPSphO0_
> > > N2meZ3zkmqlidRwwe_cJpsXLcp10.
> > >
> > > So if you need the features in existing storage engine, please keep
> using
> > > the existing storage engine. If you want to have a more predictable and
> > > lower read latency, also the features supported by RocksEngine are
> enough
> > > for your use cases, then RocksEngine could be a fit for you.
> > >
> > > The work is 1% finished, and we want to work together with community to
> > > make it happen. We presented the work in NGCC last week, and also
> pushed
> > > the beta version of the pluggable storage engine to Instagram github
> > > Cassandra repo, rocks_3.0 branch (https://github.com/Instagram/
> > > cassandra/tree/rocks_3.0), which is based on C* 3.0.12, please feel
> free
> > > to play with it! You can download it and follow the instructions (
> > > https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md
> )
> > > to try it out in your test environment, your feedback will be very
> > valuable
> > > to us.
> > >
> > > Thanks
> > > Dikang.
> > >
> > >
> >
> >
> > --
> > Dikang
> >
>



-- 
Dikang

Re: Cassandra pluggable storage engine (update)

Posted by DuyHai Doan <do...@gmail.com>.

Excellent docs, thanks for the update Dikang.

A question about a design choice, is there any technical reason to specify
the storage engine at keyspace level rather than table level ?

It's not overly complicated to move all tables sharing the same storage
engine into the same keyspace but then it makes tables organization
strongly tied to technical storage engine choice rather than functional
splitting

Regards

On Wed, Oct 4, 2017 at 10:47 PM, Dikang Gu <di...@gmail.com> wrote:

> Hi Blake,
>
> Great questions!
>
> 1. Yeah, we implement the encoding algorithms, which could encode C* data
> types into byte array, and keep the same sorting order. Our implementation
> is based on the orderly lib used in HBase,
> https://github.com/ndimiduk/orderly .
> 2. Repair is not supported yet, we are still working on figure out the work
> need to be done to support repair or incremental repair.
>
> Thanks
> Dikang.
>
> On Wed, Oct 4, 2017 at 1:39 PM, Blake Eggleston <be...@apple.com>
> wrote:
>
> > Hi Dikang,
> >
> > Cool stuff. 2 questions. Based on your presentation at ngcc, it seems
> like
> > rocks db stores things in byte order. Does this mean that you have code
> > that makes each of the existing types byte comparable, or is clustering
> > order implementation dependent? Also, I don't see anything in the draft
> api
> > that seems to support splitting the data set into arbitrary categories
> (ie
> > repaired and unrepaired data living in the same token range). Is support
> > for incremental repair planned for v1?
> >
> > Thanks,
> >
> > Blake
> >
> >
> > On October 4, 2017 at 1:28:01 PM, Dikang Gu (dikang85@gmail.com) wrote:
> >
> > Hello C* developers:
> >
> > In my previous email (https://www.mail-archive.com/
> > dev@cassandra.apache.org/msg11024.html), I presented that Instagram was
> > kicking off a project to make C*'s storage engine to be pluggable, as
> other
> > modern databases, like mysql, mongoDB etc, so that users will be able to
> > choose most suitable storage engine for different work load, or to use
> > different features. In addition to that, a pluggable storage engine
> > architecture will improve the modularity of the system, help to increase
> > the testability and reliability of Cassandra.
> >
> > After months of development and testing, we'd like to share the work we
> > have done, including the first(draft) version of the C* storage engine
> API,
> > and the first version of the RocksDB based storage engine.
> >
> > 
> >
> >
> > For the C* storage engine API, here is the draft version we proposed,
> > https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-
> > SR9O20jud_0jnA-mW7ttp2dVmk/edit. It contains the APIs for read/write
> > requests, streaming, and table management. The storage engine related
> > functionalities, like data encoding/decoding format, on-disk data
> > read/write, compaction, etc, will be taken care by the storage engine
> > implementation.
> >
> > Each storage engine is a class with each instance of the class is stored
> > in the Keyspace instance. So all the column families within a keyspace
> will
> > share one storage engine instance.
> >
> > Once a storage engine instance is created, Cassandra sever issues
> commands
> > to the engine instance to performance data storage and retrieval tasks
> such
> > as opening a column family, managing column families and streaming.
> >
> > How to config storage engine for different keyspaces? It's still open for
> > discussion. One proposal is that we can add the storage engine option in
> > the create keyspace cql command, and potentially we can overwrite the
> > option per C* node in its config file.
> >
> > Under that API, we implemented a new storage engine, based on RocksDB,
> > called RocksEngine. In long term, we want to support most of C* existing
> > features in RocksEngine, and we want to build it in a progressive manner.
> > For the first version of the RocksDBEngine, we support following
> features:
> > Most of non-nested data types
> > Table schema
> > Point query
> > Range query
> > Mutations
> > Timestamp
> > TTL
> > Deletions/Cell tombstones
> > Streaming
> > We do not supported following features in first version yet:
> > Multi-partition query
> > Nested data types
> > Counters
> > Range tombstone
> > Materialized views
> > Secondary indexes
> > SASI
> > Repair
> > At this moment, we've implemented the V1 features, and deployed it to our
> > shadow cluster. Using shadowing traffic of our production use cases, we
> saw
> > ~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here are
> > some detailed metrics: https://docs.google.com/
> document/d/1DojHPteDPSphO0_
> > N2meZ3zkmqlidRwwe_cJpsXLcp10.
> >
> > So if you need the features in existing storage engine, please keep using
> > the existing storage engine. If you want to have a more predictable and
> > lower read latency, also the features supported by RocksEngine are enough
> > for your use cases, then RocksEngine could be a fit for you.
> >
> > The work is 1% finished, and we want to work together with community to
> > make it happen. We presented the work in NGCC last week, and also pushed
> > the beta version of the pluggable storage engine to Instagram github
> > Cassandra repo, rocks_3.0 branch (https://github.com/Instagram/
> > cassandra/tree/rocks_3.0), which is based on C* 3.0.12, please feel free
> > to play with it! You can download it and follow the instructions (
> > https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md)
> > to try it out in your test environment, your feedback will be very
> valuable
> > to us.
> >
> > Thanks
> > Dikang.
> >
> >
>
>
> --
> Dikang
>

Re: Cassandra pluggable storage engine (update)

Posted by Dikang Gu <di...@gmail.com>.

Hi Blake,

Great questions!

1. Yeah, we implement the encoding algorithms, which could encode C* data
types into byte array, and keep the same sorting order. Our implementation
is based on the orderly lib used in HBase,
https://github.com/ndimiduk/orderly .
2. Repair is not supported yet, we are still working on figure out the work
need to be done to support repair or incremental repair.

Thanks
Dikang.

On Wed, Oct 4, 2017 at 1:39 PM, Blake Eggleston <be...@apple.com>
wrote:

> Hi Dikang,
>
> Cool stuff. 2 questions. Based on your presentation at ngcc, it seems like
> rocks db stores things in byte order. Does this mean that you have code
> that makes each of the existing types byte comparable, or is clustering
> order implementation dependent? Also, I don't see anything in the draft api
> that seems to support splitting the data set into arbitrary categories (ie
> repaired and unrepaired data living in the same token range). Is support
> for incremental repair planned for v1?
>
> Thanks,
>
> Blake
>
>
> On October 4, 2017 at 1:28:01 PM, Dikang Gu (dikang85@gmail.com) wrote:
>
> Hello C* developers:
>
> In my previous email (https://www.mail-archive.com/
> dev@cassandra.apache.org/msg11024.html), I presented that Instagram was
> kicking off a project to make C*'s storage engine to be pluggable, as other
> modern databases, like mysql, mongoDB etc, so that users will be able to
> choose most suitable storage engine for different work load, or to use
> different features. In addition to that, a pluggable storage engine
> architecture will improve the modularity of the system, help to increase
> the testability and reliability of Cassandra.
>
> After months of development and testing, we'd like to share the work we
> have done, including the first(draft) version of the C* storage engine API,
> and the first version of the RocksDB based storage engine.
>
> 
>
>
> For the C* storage engine API, here is the draft version we proposed,
> https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-
> SR9O20jud_0jnA-mW7ttp2dVmk/edit. It contains the APIs for read/write
> requests, streaming, and table management. The storage engine related
> functionalities, like data encoding/decoding format, on-disk data
> read/write, compaction, etc, will be taken care by the storage engine
> implementation.
>
> Each storage engine is a class with each instance of the class is stored
> in the Keyspace instance. So all the column families within a keyspace will
> share one storage engine instance.
>
> Once a storage engine instance is created, Cassandra sever issues commands
> to the engine instance to performance data storage and retrieval tasks such
> as opening a column family, managing column families and streaming.
>
> How to config storage engine for different keyspaces? It's still open for
> discussion. One proposal is that we can add the storage engine option in
> the create keyspace cql command, and potentially we can overwrite the
> option per C* node in its config file.
>
> Under that API, we implemented a new storage engine, based on RocksDB,
> called RocksEngine. In long term, we want to support most of C* existing
> features in RocksEngine, and we want to build it in a progressive manner.
> For the first version of the RocksDBEngine, we support following features:
> Most of non-nested data types
> Table schema
> Point query
> Range query
> Mutations
> Timestamp
> TTL
> Deletions/Cell tombstones
> Streaming
> We do not supported following features in first version yet:
> Multi-partition query
> Nested data types
> Counters
> Range tombstone
> Materialized views
> Secondary indexes
> SASI
> Repair
> At this moment, we've implemented the V1 features, and deployed it to our
> shadow cluster. Using shadowing traffic of our production use cases, we saw
> ~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here are
> some detailed metrics: https://docs.google.com/document/d/1DojHPteDPSphO0_
> N2meZ3zkmqlidRwwe_cJpsXLcp10.
>
> So if you need the features in existing storage engine, please keep using
> the existing storage engine. If you want to have a more predictable and
> lower read latency, also the features supported by RocksEngine are enough
> for your use cases, then RocksEngine could be a fit for you.
>
> The work is 1% finished, and we want to work together with community to
> make it happen. We presented the work in NGCC last week, and also pushed
> the beta version of the pluggable storage engine to Instagram github
> Cassandra repo, rocks_3.0 branch (https://github.com/Instagram/
> cassandra/tree/rocks_3.0), which is based on C* 3.0.12, please feel free
> to play with it! You can download it and follow the instructions (
> https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md)
> to try it out in your test environment, your feedback will be very valuable
> to us.
>
> Thanks
> Dikang.
>
>


-- 
Dikang

Re: Cassandra pluggable storage engine (update)

Posted by Blake Eggleston <be...@apple.com>.

Hi Dikang,

Cool stuff. 2 questions. Based on your presentation at ngcc, it seems like rocks db stores things in byte order. Does this mean that you have code that makes each of the existing types byte comparable, or is clustering order implementation dependent? Also, I don't see anything in the draft api that seems to support splitting the data set into arbitrary categories (ie repaired and unrepaired data living in the same token range). Is support for incremental repair planned for v1?

Thanks,

Blake

On October 4, 2017 at 1:28:01 PM, Dikang Gu (dikang85@gmail.com) wrote:

Hello C* developers:

In my previous email (https://www.mail-archive.com/dev@cassandra.apache.org/msg11024.html), I presented that Instagram was kicking off a project to make C*'s storage engine to be pluggable, as other modern databases, like mysql, mongoDB etc, so that users will be able to choose most suitable storage engine for different work load, or to use different features. In addition to that, a pluggable storage engine architecture will improve the modularity of the system, help to increase the testability and reliability of Cassandra.

After months of development and testing, we'd like to share the work we have done, including the first(draft) version of the C* storage engine API, and the first version of the RocksDB based storage engine.

For the C* storage engine API, here is the draft version we proposed, https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-SR9O20jud_0jnA-mW7ttp2dVmk/edit. It contains the APIs for read/write requests, streaming, and table management. The storage engine related functionalities, like data encoding/decoding format, on-disk data read/write, compaction, etc, will be taken care by the storage engine implementation.

Each storage engine is a class with each instance of the class is stored in the Keyspace instance. So all the column families within a keyspace will share one storage engine instance.

Once a storage engine instance is created, Cassandra sever issues commands to the engine instance to performance data storage and retrieval tasks such as opening a column family, managing column families and streaming.

How to config storage engine for different keyspaces? It's still open for discussion. One proposal is that we can add the storage engine option in the create keyspace cql command, and potentially we can overwrite the option per C* node in its config file.

Under that API, we implemented a new storage engine, based on RocksDB, called RocksEngine. In long term, we want to support most of C* existing features in RocksEngine, and we want to build it in a progressive manner. For the first version of the RocksDBEngine, we support following features:
Most of non-nested data types
Table schema
Point query
Range query
Mutations
Timestamp
TTL
Deletions/Cell tombstones
Streaming
We do not supported following features in first version yet:
Multi-partition query
Nested data types
Counters
Range tombstone
Materialized views
Secondary indexes
SASI
Repair
At this moment, we've implemented the V1 features, and deployed it to our shadow cluster. Using shadowing traffic of our production use cases, we saw ~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here are some detailed metrics: https://docs.google.com/document/d/1DojHPteDPSphO0_N2meZ3zkmqlidRwwe_cJpsXLcp10.

So if you need the features in existing storage engine, please keep using the existing storage engine. If you want to have a more predictable and lower read latency, also the features supported by RocksEngine are enough for your use cases, then RocksEngine could be a fit for you.

The work is 1% finished, and we want to work together with community to make it happen. We presented the work in NGCC last week, and also pushed the beta version of the pluggable storage engine to Instagram github Cassandra repo, rocks_3.0 branch (https://github.com/Instagram/cassandra/tree/rocks_3.0), which is based on C* 3.0.12, please feel free to play with it! You can download it and follow the instructions (https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md) to try it out in your test environment, your feedback will be very valuable to us.

Thanks
Dikang.