You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Adam Lamar <ad...@gmail.com> on 2015/02/28 17:58:22 UTC
File extension for application/gzip
Tika users,
I've run into a small issue with Tika's mime type repository.
TikaConfig config = new TikaConfig();
MimeType mimeType = config.getMimeRepository().forName("application/gzip");
When I call mimeType.getExtension(), the returned value is ".tgz". This is
fine when the underlying file is a tar, but if I gzip a plain text
document, ".tgz" is also returned. Is this expected behavior, bug, or
feature?
I'd prefer instead that it return ".gz", even if the underlying file is a
tar.
Cheers,
Adam
Re: File extension for application/gzip
Posted by Adam Lamar <ad...@gmail.com>.
> Best bet would be to open a jira for this, then the change can be tracked and will have a jira id
https://issues.apache.org/jira/browse/TIKA-1563
Many thanks,
Adam
Re: File extension for application/gzip
Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 28 Feb 2015, Adam Lamar wrote:
> I'd appreciate a change of the default!
Best bet would be to open a jira for this, then the change can be tracked
and will have a jira id
> Every tgz is application/gzip, but not every application/gzip is a tgz.
> Also, it seems to me that the parsers should be able to decompress the
> first few bytes and check for the tar magic bytes at offset 257, if it
> were important to differentiate between a gz and tgz on specific files
> (if this is not already done).
The compressed file parser already does that!
Won't help for mime magic detection though, as that has to work on the raw
(and hence compressed) byte patterns
Nick
Re: File extension for application/gzip
Posted by Adam Lamar <ad...@gmail.com>.
I'd appreciate a change of the default! Every tgz is application/gzip,
but not every application/gzip is a tgz. Also, it seems to me that the
parsers should be able to decompress the first few bytes and check for
the tar magic bytes at offset 257, if it were important to differentiate
between a gz and tgz on specific files (if this is not already done).
Is it possible to change this behavior with a custom-mimetypes file? I
tried redefining application/gzip with a higher magic priority, removing
the tgz entry, but I still saw .tgz when asking the mime repository.
Adam
On 2/28/15 8:20 PM, Nick Burch wrote:
> I wonder if it would be worth changing the default? I agree, .tgz is a
> gzip extension, but probably not the most common / universal
Re: File extension for application/gzip
Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 28 Feb 2015, Adam Lamar wrote:
> MimeType mimeType = config.getMimeRepository().forName("application/gzip");
>
> When I call mimeType.getExtension(), the returned value is ".tgz".
That mime type has multiple extensions defined, with .tgz just the first.
I wonder if it would be worth changing the default? I agree, .tgz is a
gzip extension, but probably not the most common / universal
Nick