You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "siying (via GitHub)" <gi...@apache.org> on 2023/10/11 22:49:25 UTC

[PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

siying opened a new pull request, #43338:
URL: https://github.com/apache/spark/pull/43338

   ### What changes were proposed in this pull request?
   Turn to use LZ4 in RocksDB state store.
   
   ### Why are the changes needed?
   LZ4 is generally faster than Snappy. That's probably we use LZ4 in changelogs and other places by default. However, we don't change RocksDB's default of Snappy compression style. The RocksDB Team recommend LZ4 or ZSTD and the default is kept to Snappy only for backward compatible reason. We should use LZ4 instead.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing tests should fix it. Some benchmarks are run and we see positive improvements.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #43338:
URL: https://github.com/apache/spark/pull/43338#discussion_r1360624677


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2001,8 +2001,10 @@ object SQLConf {
     buildConf("spark.sql.streaming.stateStore.compression.codec")
       .internal()
       .doc("The codec used to compress delta and snapshot files generated by StateStore. " +
-        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
-        "use fully qualified class names to specify the codec. Default codec is lz4.")
+        "It is also applied to RocksDB State Store's RocksDB compression type if possible. By " +
+        "default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +

Review Comment:
   Personally, I wouldn't recommend reusing this conf.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on code in PR #43338:
URL: https://github.com/apache/spark/pull/43338#discussion_r1364507582


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2001,8 +2001,10 @@ object SQLConf {
     buildConf("spark.sql.streaming.stateStore.compression.codec")
       .internal()
       .doc("The codec used to compress delta and snapshot files generated by StateStore. " +
-        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
-        "use fully qualified class names to specify the codec. Default codec is lz4.")
+        "It is also applied to RocksDB State Store's RocksDB compression type if possible. By " +
+        "default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +

Review Comment:
   Yeah... let's do that. I understand we have too many knobs, but it would be harder to explain if we go with one config for two configs which are not 100% compatible. Also the available codecs are tied with RocksDB and we have to track the change and reflect to the doc. That doesn't sound to me to be trivial - we just want to delegate it to RocksDB doc, like we do for Kafka for a lot of options.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43338:
URL: https://github.com/apache/spark/pull/43338#discussion_r1360011693


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2001,8 +2001,10 @@ object SQLConf {
     buildConf("spark.sql.streaming.stateStore.compression.codec")
       .internal()
       .doc("The codec used to compress delta and snapshot files generated by StateStore. " +
-        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
-        "use fully qualified class names to specify the codec. Default codec is lz4.")
+        "It is also applied to RocksDB State Store's RocksDB compression type if possible. By " +
+        "default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +

Review Comment:
   > ... Spark provides four codecs: lz4, lzf, snappy, and zstd.
   
   seems rocksdb does not support lzf.
   
   if decide to reuse this conf, description needs to udpate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1760030083

   Can we make it configurable?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] Add Conf to Set RocksDB Compression [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1777981673

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "siying (via GitHub)" <gi...@apache.org>.
siying commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1760132343

   > Can we make it configurable?
   
   It's possible, but I would avoid yet another knob if possible. If you really feel that it should be configurable, how do you think about we reuse `spark.sql.streaming.stateStore.compression.codec`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on code in PR #43338:
URL: https://github.com/apache/spark/pull/43338#discussion_r1364507582


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2001,8 +2001,10 @@ object SQLConf {
     buildConf("spark.sql.streaming.stateStore.compression.codec")
       .internal()
       .doc("The codec used to compress delta and snapshot files generated by StateStore. " +
-        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
-        "use fully qualified class names to specify the codec. Default codec is lz4.")
+        "It is also applied to RocksDB State Store's RocksDB compression type if possible. By " +
+        "default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +

Review Comment:
   Yeah... let's do that. I understand we have too many knobs, but it would be harder to explain if we go with one config for effectively two configs which are not 100% compatible. Also the available codecs are tied with RocksDB and we have to track the change and reflect to the doc. That doesn't sound to me to be trivial - we just want to delegate it to RocksDB doc, like we do for Kafka for a lot of options.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "siying (via GitHub)" <gi...@apache.org>.
siying commented on code in PR #43338:
URL: https://github.com/apache/spark/pull/43338#discussion_r1364351163


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2001,8 +2001,10 @@ object SQLConf {
     buildConf("spark.sql.streaming.stateStore.compression.codec")
       .internal()
       .doc("The codec used to compress delta and snapshot files generated by StateStore. " +
-        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
-        "use fully qualified class names to specify the codec. Default codec is lz4.")
+        "It is also applied to RocksDB State Store's RocksDB compression type if possible. By " +
+        "default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +

Review Comment:
   @HeartSaVioR I can do that if that's what you recommend.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1760915017

   > > > The RocksDB Team recommend LZ4 or ZSTD ...
   > > 
   > > 
   > > Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing algorithm
   > 
   > ZSTD has good compression ratio but is slower. LZ4 is the fast one with worse compression ratio (which is similar to Snappy). For Spark Structured Streaming, CPU is more of a bottleneck, rather than I/O or space, so LZ4 is a better choice.
   
   Thanks for the explanation, I do agree that lz4 consumes a little less CPU than zstd (even with the level 1), but the CPU/compression ratio is also related to the data content.
   
   > Can we make it configurable?
   
   Big agree, make it configurable so that users can switch to high speed/compression ratio algorithms based on their hardware setup (e.g. we have some clusters with small space SSD on the compute node, and it's likely to offload gzip/zstd codec from the CPU to FPGA or dedicated hardware in the future)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1759050265

   Something may be beyond the PR scope: 
   
   Spark now uses lz4 for shuffle by default, but we switched it to zstd(level 1) internally for over a year, basically gaining the same e2e performance with ~50% shuffle data size reduction. Maybe it's time to migrate the default `spark.io.compression.codec` from lz4 to zstd in Spark 4.0.0 too?
   
   Also cc @dongjoon-hyun @wangyum @LuciferYang @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1759035172

   > The RocksDB Team recommend LZ4 or ZSTD ...
   
   Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing algorithm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] Add Conf to Set RocksDB Compression [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR closed pull request #43338: [SPARK-45503][SS] Add Conf to Set RocksDB Compression
URL: https://github.com/apache/spark/pull/43338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on code in PR #43338:
URL: https://github.com/apache/spark/pull/43338#discussion_r1363088428


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2001,8 +2001,10 @@ object SQLConf {
     buildConf("spark.sql.streaming.stateStore.compression.codec")
       .internal()
       .doc("The codec used to compress delta and snapshot files generated by StateStore. " +
-        "By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +
-        "use fully qualified class names to specify the codec. Default codec is lz4.")
+        "It is also applied to RocksDB State Store's RocksDB compression type if possible. By " +
+        "default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also " +

Review Comment:
   If the supported formats do not exactly match, then probably yes, we may need another config specifically for RocksDB.
   @siying WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "siying (via GitHub)" <gi...@apache.org>.
siying commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1762300061

   @dongjoon-hyun here it is: https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#compression . I'm updating the description.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] Add Conf to Set RocksDB Compression [spark]

Posted by "siying (via GitHub)" <gi...@apache.org>.
siying commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1777847399

   Tests all passed now. CC @HeartSaVioR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

Posted by "siying (via GitHub)" <gi...@apache.org>.
siying commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1759990922

   > > The RocksDB Team recommend LZ4 or ZSTD ...
   > 
   > Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing algorithm
   
   ZSTD has good compression ratio but is slower. LZ4 is the fast one with worse compression ratio (which is similar to Snappy). For Spark Structured Streaming, CPU is more of a bottleneck, rather than I/O or space, so LZ4 is a better choice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org