You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Stefan Bodewig (JIRA)" <ji...@apache.org> on 2017/06/10 18:13:18 UTC

[jira] [Commented] (COMPRESS-403) Block and Record Size issues in TarArchiveOutputStream

    [ https://issues.apache.org/jira/browse/COMPRESS-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045640#comment-16045640 ] 

Stefan Bodewig commented on COMPRESS-403:
-----------------------------------------

I'm not exactly sure I understand what you want to see changed :-)

One thing I read is that you'd like to see the defaults changed as you expect different defaults to provide better performance. Another thing I see is that you'd like us to use a proprietary global pax header entry in order to signal the record size we use. Is this correct?

> Block and Record Size issues in  TarArchiveOutputStream 
> --------------------------------------------------------
>
>                 Key: COMPRESS-403
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-403
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Archivers
>    Affects Versions: 1.14
>            Reporter: Simon Spero
>            Priority: Minor
>
> According to the pax spec 
>  [§4.100.13.01| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_01] 
> bq. A pax archive tape or file produced in the -x pax format shall contain a series of blocks. The physical layout of the archive shall be identical to the ustar format
> [§ 4.100.13.06| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_06] 
> bq. A ustar archive tape or file shall contain a series of logical records. Each logical record shall be a fixed-size logical record of 512 octets.
> ...
> bq. The logical records *may* be grouped for physical I/O operations, as described under the -b blocksize and -x ustar options. Each group of logical records *may* be written with a single operation equivalent to the write() function. On magnetic tape, the result of this write *shall* be a single tape physical block. The last physical block *shall* always be the full size, so logical records after the two zero logical records *may* contain undefined data.
> bq. pax. The default blocksize for this format for character special archive files *shall* be 5120. Implementations *shall* support all blocksize values less than or equal to 32256 that are multiples of 512.
> bq. ustar. The default blocksize for this format for character special archive files *shall* be 10240. Implementations *shall* support all blocksize values less than or equal to 32256 that are multiples of 512.
> bq. Implementations are permitted to modify the block-size value based on the archive format or the device to which the archive is being written. This is to provide implementations with the opportunity to take advantage of special types of devices, and it should not be used without a great deal of consideration as it almost certainly decreases archive portability.
> The current implementation of TarArchiveOutputStream
> # Allows the logical record size to be altered
> # Has a default block size of 10240  
> # has two separate logical-record size buffers, and frequently double buffers in order to write to the wrapped outputstream in units of a logical record, rather than a physical block.
> I would hazard a guess that very few users commons-compress are writing directly to a tape drive, where the block-size is of great import.  It is also not possible to guarantee that a subordinate output stream won't buffer in chunks  of a different size (5120 and 10240 bytes aren't ideal for modern hard drives with 4096 byte sectors, or filesystems like ZFS with a default recordsize of 128K).  
> The main effect of the record and block size have is the extra padding they require. For the purposes of the java output  device, the optimal blocksize  to modify to is probably just a single record; since all implementations must handle 512 byte blocks, and must detect block size on input (or simulate same), this cannot affect compatibility. 
> Fixed length blocking in multiples of 512 Bytes can be supported by wrapping the destination output stream in a modified BufferedOutputStream that does not permit flushing of partial blocks, and pads on close. This would only be used as necessary. 
>  
> If a record size of 512 bytes is being used, it could be useful to store that information in an extended header at the start of the file. That allows for in-place appending to an archive without having to read the entire archive first (as long as the original end-of-archive location is journaled to support recovery). 
> There is even an advantage for xz compressed files, as every block but the last can be copied without having to decompress then recompress, 
> In the latter scenario, it would be useful to be able to signal to the subordinate layer to start a new block before  writing the final 1024 nulls; in that situation, either a new block can be started overwriting the EOA and xz index blocks, with the saved index info saved at the end; or the block immediately preceding the EOA markers can be decompressed and recompressed, which will rebuild the dictionary and index structures to allow the block to be continued. That's a different issue.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)