You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/04/04 16:51:13 UTC

Wrong Mime-type detection

Hi,

I've got some OCR'd books in plain text format which are incorrectly marked as 
application/x-elc, probably because of the junk bytes in the head and 
sometimes tail. I've also seen some being marked as shockwave files. The 
additional problem is that the file program also marks these files as Lisp 
data.

My question is, how do you handle files that are given an incorrect mime-type? 
Are there some best practices?

Here are a few example files:

http://ia600400.us.archive.org/34/items/papersfromtortug183121922carn/papersfromtortug183121922carn_djvu.txt
http://ia700108.us.archive.org/14/items/reportofcommissi1881unit/reportofcommissi1881unit_djvu.txt
http://ia600300.us.archive.org/12/items/stuttgarterbeitr2747197779staa/stuttgarterbeitr2747197779staa_djvu.txt
http://ia600100.us.archive.org/9/items/nicolaijosephija01jacq/nicolaijosephija01jacq_djvu.txt

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Wrong Mime-type detection

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 6 Apr 2011, Markus Jelsma wrote:
> However, removing all types other than plain/text from tika-mimetypes 
> might do the trick. Will Tika then fall back to that type even if it 
> first doesn't mark as such?

Tika will fall back to application/octet-stream if it doesn't know what a 
file is.

In the case of something like text/csv extends text/plain, then if you 
remove the entry for text/csv then Tika should fall back on text/plain 
(except if the match was only occuring on the latter)

Nick

Re: Wrong Mime-type detection

Posted by Markus Jelsma <ma...@openindex.io>.
 Hi Nick,

 Tika is deployed in my Nutch and it's not a simple task to adjust 
 parse-tika for the first two suggestions. However, removing all types 
 other than plain/text from tika-mimetypes might do the trick. Will Tika 
 then fall back to that type even if it first doesn't mark as such?

 Thanks,

 On Mon, 4 Apr 2011 17:22:20 +0100 (BST), Nick Burch 
 <ni...@alfresco.com> wrote:
> On Mon, 4 Apr 2011, Markus Jelsma wrote:
>> I've got some OCR'd books in plain text format which are incorrectly 
>> marked as application/x-elc, probably because of the junk bytes in the 
>> head and sometimes tail. I've also seen some being marked as shockwave 
>> files. The additional problem is that the file program also marks 
>> these files as Lisp data.
>
> One option is that if you trust the filename, you could match on just
> that. For example, you could decide that
> http://ia600400.us.archive.org/ can be trusted to get the content 
> type
> correct on text files, and just use theirs.
>
> Another one that could work in your special case could be to detect
> on both the start of the file, and say 10% and 20% in. If you get
> octet stream for the 10% and 20% then you know the first detection is
> likely to be correct. If they both give text, then there's a fair
> chance it's actually one of your iffy text files.
>
> Finally, if you know none of your files will be of certain
> problematic types, you could try just removing them from your mime
> magic list?
>
> Nick

-- 

Re: Wrong Mime-type detection

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 4 Apr 2011, Markus Jelsma wrote:
> I've got some OCR'd books in plain text format which are incorrectly 
> marked as application/x-elc, probably because of the junk bytes in the 
> head and sometimes tail. I've also seen some being marked as shockwave 
> files. The additional problem is that the file program also marks these 
> files as Lisp data.

One option is that if you trust the filename, you could match on just 
that. For example, you could decide that http://ia600400.us.archive.org/ 
can be trusted to get the content type correct on text files, and just use 
theirs.

Another one that could work in your special case could be to detect on 
both the start of the file, and say 10% and 20% in. If you get octet 
stream for the 10% and 20% then you know the first detection is likely to 
be correct. If they both give text, then there's a fair chance it's 
actually one of your iffy text files.

Finally, if you know none of your files will be of certain problematic 
types, you could try just removing them from your mime magic list?

Nick