You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2022/10/28 21:47:00 UTC

[jira] [Comment Edited] (HDDS-7350) Ozone Transparent Data Compression Support

    [ https://issues.apache.org/jira/browse/HDDS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625596#comment-17625596 ] 

Stephen O'Donnell edited comment on HDDS-7350 at 10/28/22 9:46 PM:
-------------------------------------------------------------------

The design document attached is almost just an overview of the proposed feature. We need to think in some detail about some parts of this. For example:

With a compressed file - can we seek to an offset and start reading there?

Ozone currently writes data in "chunks" 4MB - do we open a new compression stream for each chunk? Or just one compression stream for the entire file?

Should the data in a chunk then be 4MB of compressed data, where it might much more than that of uncompressed data? Or do we keep the chunks at 4MB of uncompressed data, and then they are smaller when they are written to the datanode? That way, we know chunk 1 is from offset 0 -> 4MB, chunk 2 is 4MB -> 8MB, etc.

Perhaps the chunk meta data could contain the uncompressed offsets in the file and the uncompressed size. That would allow for seeking to a chunk boundary and starting to read the new compression stream from there.

EC perhaps isn't too different. We would just EC encode the compressed chunks, although a variable chunksize might give EC problems. Whatever we do here, we would need to be sure EC can fit into the same framework, as users will surely want transparent compression on EC data too.

In EC, we implemented a kind of hierarchy to set the replication type of a key. There is a server default, bucket level setting and key level setting. That means if nothing is specified the server default is used. If there is a bucket setting key inherit it, but can override that if they like. Or if there is no bucket setting, the key level settings work. For consistency we should aim to do the same thing here.


was (Author: sodonnell):
The design document attached is almost just an overview of the proposed feature. We need to think in some detail about some parts of this. For example:

With a compressed file - can we seek to an offset and start reading there?

Ozone currently writes data in "chunks" 4MB - do we open a new compression stream for each chunk? Or just one compression stream for the entire file?

Should the data in a chunk then be 4MB of compressed data, where it might much more than that of compressed data? Or do we keep the chunks at 4MB of uncompressed data, and then they are smaller when they are written to the datanode? That way, we know chunk 1 is from offset 0 -> 4MB, chunk 2 is 4MB -> 8MB, etc.

Perhaps the chunk meta data could contain the uncompressed offsets in the file and the uncompressed size. That would allow for seeking to a chunk boundary and starting to read the new compression stream from there.

EC perhaps isn't too different. We would just EC encode the compressed chunks, although a variable chunksize might give EC problems. Whatever we do here, we would need to be sure EC can fit into the same framework, as users will surely want transparent compression on EC data too.

In EC, we implemented a kind of hierarchy to set the replication type of a key. There is a server default, bucket level setting and key level setting. That means if nothing is specified the server default is used. If there is a bucket setting key inherit it, but can override that if they like. Or if there is no bucket setting, the key level settings work. For consistency we should aim to do the same thing here.

> Ozone Transparent Data Compression Support
> ------------------------------------------
>
>                 Key: HDDS-7350
>                 URL: https://issues.apache.org/jira/browse/HDDS-7350
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Kirill Sizov
>            Assignee: Kirill Sizov
>            Priority: Major
>         Attachments: compression_ozone - 2022.10.1.pdf, compression_ozone-2022.10.2.pdf
>
>
> Currently Ozone stores uncompressed data, which in case of text or a similar format may benefit from being compressed. This may save significant amount of space and hence the money.
> See the attached document for the design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org