You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Adam Lamar <ad...@gmail.com> on 2015/02/28 17:58:22 UTC

File extension for application/gzip

Tika users,

I've run into a small issue with Tika's mime type repository.

TikaConfig config = new TikaConfig();
MimeType mimeType = config.getMimeRepository().forName("application/gzip");

When I call mimeType.getExtension(), the returned value is ".tgz". This is
fine when the underlying file is a tar, but if I gzip a plain text
document, ".tgz" is also returned. Is this expected behavior, bug, or
feature?

I'd prefer instead that it return ".gz", even if the underlying file is a
tar.

Cheers,
Adam

Re: File extension for application/gzip

Posted by Adam Lamar <ad...@gmail.com>.
> Best bet would be to open a jira for this, then the change can be tracked and will have a jira id

https://issues.apache.org/jira/browse/TIKA-1563

Many thanks,
Adam

Re: File extension for application/gzip

Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 28 Feb 2015, Adam Lamar wrote:
> I'd appreciate a change of the default!

Best bet would be to open a jira for this, then the change can be tracked 
and will have a jira id

> Every tgz is application/gzip, but not every application/gzip is a tgz. 
> Also, it seems to me that the parsers should be able to decompress the 
> first few bytes and check for the tar magic bytes at offset 257, if it 
> were important to differentiate between a gz and tgz on specific files 
> (if this is not already done).

The compressed file parser already does that!

Won't help for mime magic detection though, as that has to work on the raw 
(and hence compressed) byte patterns

Nick

Re: File extension for application/gzip

Posted by Adam Lamar <ad...@gmail.com>.
I'd appreciate a change of the default! Every tgz is application/gzip, 
but not every application/gzip is a tgz. Also, it seems to me that the 
parsers should be able to decompress the first few bytes and check for 
the tar magic bytes at offset 257, if it were important to differentiate 
between a gz and tgz on specific files (if this is not already done).

Is it possible to change this behavior with a custom-mimetypes file? I 
tried redefining application/gzip with a higher magic priority, removing 
the tgz entry, but I still saw .tgz when asking the mime repository.

Adam

On 2/28/15 8:20 PM, Nick Burch wrote:
> I wonder if it would be worth changing the default? I agree, .tgz is a 
> gzip extension, but probably not the most common / universal 


Re: File extension for application/gzip

Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 28 Feb 2015, Adam Lamar wrote:
> MimeType mimeType = config.getMimeRepository().forName("application/gzip");
>
> When I call mimeType.getExtension(), the returned value is ".tgz".

That mime type has multiple extensions defined, with .tgz just the first.

I wonder if it would be worth changing the default? I agree, .tgz is a 
gzip extension, but probably not the most common / universal

Nick