You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2020/12/11 04:48:00 UTC

[jira] [Closed] (SPARK-30946) FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to serialize/deserialize entry

     [ https://issues.apache.org/jira/browse/SPARK-30946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim closed SPARK-30946.
--------------------------------

> FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to serialize/deserialize entry
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30946
>                 URL: https://issues.apache.org/jira/browse/SPARK-30946
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> HDFSMetadataLog and its descendants are normally using JSON serialization to serialize/deserialize entries.
> While it's good to support backward compatibility (like field addition and field deletion), it tends to take bunch of overhead as it adds field names, and should store all data types to string (at least when it's being written to file), works badly for some kind of fields - e.g. timestamp.
> The major overhead is heavily affecting to CompactibleFileStreamLog, as "compact" operation requires to load all entities and do the transformation/filtering (I haven't seen any effective operation being implemented though), and store them altogether into one file. This is the one of major reason why the metadata file is too huge and it brings unacceptable latency on "compact" operation.
> Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. The latest modification of both classes were year 2016. We can call it "stable" - and then we have more option to optimize serde.
> One easy but pretty effective approach on optimizing serde is converting to UnsafeRow and storing it on the same way we do in HDFSBackedStateStoreProvider, and vice versa. It has being running for 2.x versions, so the approach is considered as safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org