You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yuanjian Li <xy...@gmail.com> on 2021/04/27 09:41:08 UTC

Re: [DISCUSS] Add RocksDB StateStore

Hi all,

Following the latest comments in SPARK-34198
<https://issues.apache.org/jira/browse/SPARK-34198>, Databricks decided to
donate the commercial implementation of the RocksDBStateStore. Compared
with the original decision, there’s only one topic we want to raise again
for discussion: can we directly add the RockDBStateStoreProvider in the
sql/core module? This suggestion based on the following reasons:

   1.

   The RocksDBStateStore aims to solve the problem of the original
   HDFSBasedStateStore, which is built-in.
   2.

   End users can conveniently set the config to use the new implementation.
   3.

   We can set the RocksDB one as the default one in the future.


For the consideration of the dependency, I also checked the rocksdbjni we
might introduce. As a JNI package
<https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/6.2.2/rocksdbjni-6.2.2.pom>,
it should not have any dependency conflicts with Apache Spark.

Any suggestions are welcome!

Best,

Yuanjian

Reynold Xin <rx...@databricks.com> 于2021年2月14日周日 上午6:54写道:

> Late +1
>
>
> On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh <vi...@gmail.com> wrote:
>
>> Hi devs,
>>
>> Thanks for all the inputs. I think overall there are positive inputs in
>> Spark community about having RocksDB state store as external module. Then
>> let's go forward with this direction and to improve structured streaming. I
>> will keep update to the JIRA SPARK-34198.
>>
>> Thanks all again for the inputs and discussion.
>>
>> Liang-Chi Hsieh
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> --------------------------------------------------------------------- To
>> unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>

Re: [DISCUSS] Add RocksDB StateStore

Posted by Liang-Chi Hsieh <vi...@gmail.com>.
I am fine with RocksDB state store as built-in state store. Actually the
proposal to have it as external module is to avoid the raised concern in the
previous effort.

The need to have it as experimental doesn't necessarily mean to have it as
external module, I think. They are two things. So I don't think the risk is
highly related to have it as external module or built-in one, except that we
have the state store as default one at the beginning. If it is not a default
one, and we explicitly mention it is an experimental feature, the risk is
not very different between an external module and built-in one. As a
built-in one just makes users easier to try it.

That said even the coming RocksDB state store has been supported for years,
I think it is safer to have it as experimental feature first as it lands to
OSS Spark.

Anyway, I think it is okay to add RocksDB state store among built-in state
stores along with HDFSBasedStateStore.

I also feel that we can just have RocksDB and replace LevelDB with RocksDB.
But this is another story.


Liang-Chi


Jungtaek Lim-2 wrote
> I think adding RocksDB state store to sql/core directly would be
> OK. Personally I also voted "either way is fine with me" against RocksDB
> state store implementation in Spark ecosystem. The overall stance hasn't
> changed, but I'd like to point out that the risk becomes quite lower than
> before, given the fact we can leverage Databricks RocksDB state store
> implementation.
> 
> I feel there were two major reasons to add RocksDB state store to external
> module;
> 
> 1. stability
> 
> Databricks RocksDB state store implementation has been supported for
> years,
> it won't require more time to incubate. We may want to review thoughtfully
> to ensure the open sourced proposal fits to the Apache Spark and still
> retains stability, but this is quite better than the previous targets to
> adopt which may not be tested in production for years.
> 
> That makes me think that we don't have to put it into external and
> consider
> it as experimental.
> 
> 2. dependency
> 
> From Yuanjian's mail, JNI library is the only dependency, which seems fine
> to add by default. We already have LevelDB as one of core dependencies and
> don't concern too much about the JNI library dependency. Probably someone
> might figure out that there are outstanding benefits on replacing LevelDB
> with RocksDB and then RocksDB can even be the one of core dependencies.
> 
> On Tue, Apr 27, 2021 at 6:41 PM Yuanjian Li &lt;

> xyliyuanjian@

> &gt; wrote:
> 
>> Hi all,
>>
>> Following the latest comments in SPARK-34198
>> &lt;https://issues.apache.org/jira/browse/SPARK-34198&gt;, Databricks
>> decided
>> to donate the commercial implementation of the RocksDBStateStore.
>> Compared
>> with the original decision, there’s only one topic we want to raise again
>> for discussion: can we directly add the RockDBStateStoreProvider in the
>> sql/core module? This suggestion based on the following reasons:
>>
>>    1.
>>
>>    The RocksDBStateStore aims to solve the problem of the original
>>    HDFSBasedStateStore, which is built-in.
>>    2.
>>
>>    End users can conveniently set the config to use the new
>>    implementation.
>>    3.
>>
>>    We can set the RocksDB one as the default one in the future.
>>
>>
>> For the consideration of the dependency, I also checked the rocksdbjni we
>> might introduce. As a JNI package
>> &lt;https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/6.2.2/rocksdbjni-6.2.2.pom&gt;,
>> it should not have any dependency conflicts with Apache Spark.
>>
>> Any suggestions are welcome!
>>
>> Best,
>>
>> Yuanjian





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: [DISCUSS] Add RocksDB StateStore

Posted by Jungtaek Lim <ka...@gmail.com>.
I think adding RocksDB state store to sql/core directly would be
OK. Personally I also voted "either way is fine with me" against RocksDB
state store implementation in Spark ecosystem. The overall stance hasn't
changed, but I'd like to point out that the risk becomes quite lower than
before, given the fact we can leverage Databricks RocksDB state store
implementation.

I feel there were two major reasons to add RocksDB state store to external
module;

1. stability

Databricks RocksDB state store implementation has been supported for years,
it won't require more time to incubate. We may want to review thoughtfully
to ensure the open sourced proposal fits to the Apache Spark and still
retains stability, but this is quite better than the previous targets to
adopt which may not be tested in production for years.

That makes me think that we don't have to put it into external and consider
it as experimental.

2. dependency

From Yuanjian's mail, JNI library is the only dependency, which seems fine
to add by default. We already have LevelDB as one of core dependencies and
don't concern too much about the JNI library dependency. Probably someone
might figure out that there are outstanding benefits on replacing LevelDB
with RocksDB and then RocksDB can even be the one of core dependencies.

On Tue, Apr 27, 2021 at 6:41 PM Yuanjian Li <xy...@gmail.com> wrote:

> Hi all,
>
> Following the latest comments in SPARK-34198
> <https://issues.apache.org/jira/browse/SPARK-34198>, Databricks decided
> to donate the commercial implementation of the RocksDBStateStore. Compared
> with the original decision, there’s only one topic we want to raise again
> for discussion: can we directly add the RockDBStateStoreProvider in the
> sql/core module? This suggestion based on the following reasons:
>
>    1.
>
>    The RocksDBStateStore aims to solve the problem of the original
>    HDFSBasedStateStore, which is built-in.
>    2.
>
>    End users can conveniently set the config to use the new
>    implementation.
>    3.
>
>    We can set the RocksDB one as the default one in the future.
>
>
> For the consideration of the dependency, I also checked the rocksdbjni we
> might introduce. As a JNI package
> <https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/6.2.2/rocksdbjni-6.2.2.pom>,
> it should not have any dependency conflicts with Apache Spark.
>
> Any suggestions are welcome!
>
> Best,
>
> Yuanjian
>
> Reynold Xin <rx...@databricks.com> 于2021年2月14日周日 上午6:54写道:
>
>> Late +1
>>
>>
>> On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh <vi...@gmail.com>
>> wrote:
>>
>>> Hi devs,
>>>
>>> Thanks for all the inputs. I think overall there are positive inputs in
>>> Spark community about having RocksDB state store as external module. Then
>>> let's go forward with this direction and to improve structured streaming. I
>>> will keep update to the JIRA SPARK-34198.
>>>
>>> Thanks all again for the inputs and discussion.
>>>
>>> Liang-Chi Hsieh
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> --------------------------------------------------------------------- To
>>> unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>