You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/25 15:14:17 UTC

[GitHub] [spark] HeartSaVioR edited a comment on issue #27694: [SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression

HeartSaVioR edited a comment on issue #27694: [SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression
URL: https://github.com/apache/spark/pull/27694#issuecomment-590903965
 
 
   Honestly I have been thinking about larger changes, like:
   
   * avoid rewriting all entities on compact operation
   * support retention (ideal to design with above item, as we won't be able to read all existing entities on compact operation)
   * use tree structure, or at least two kinds of entry "directory" and "file" to heavily reduce down path string on entries
   * streamline the compaction - instead of loading all entities to do next operation, iterate the loop on "load an entry -> transform/filter -> store entry if not filtered out". This would help on reducing driver memory usage on compact operation.
   
   but would like to have priorities on the perspective of (less changes & bigger impact), and make changes incrementally. 
   
   This patch brings the least changes but great impact on performance. Above items are orthogonal to this improvement so they can be addressed on demand later. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org