You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "epastoor@vt.edu" <ep...@vt.edu> on 2017/08/23 14:06:31 UTC

Detecting .bat and .cmd files

I'm trying to get tika to detect .bat and .cmd files. Both are returning as text/plain.

In the xml file, (https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml)
bat falls under application/x-msdownload but yet it returns as text/plain.

cmd is under text/plain also surprisingly. I would have expected it to be with .bat.

Has anyone had tika properly detect batch script files?

The closest thing I can find when searching for this is this unresolved ticket: https://issues.apache.org/jira/browse/TIKA-1148


When I run the tika-app jar by itself, I get the same results (plain/text) as when I'm doing this through java code.

> java -jar tika-app-1.16.jar -d BatchInstall.bat
Aug 23, 2017 9:40:22 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Aug 23, 2017 9:40:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
text/plain

=====================
Java version
private static final Tika CONTENT_TYPE_DETECTOR = new Tika();
return CONTENT_TYPE_DETECTOR.detect(fileItem.get(), fileItem.getName())
// Returns text/plain


Re: Detecting .bat and .cmd files

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Aug 2017, epastoor@vt.edu wrote:
> I'm trying to get tika to detect .bat and .cmd files. Both are returning as text/plain.
>
> In the xml file, 
> (https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) 
> bat falls under application/x-msdownload but yet it returns as 
> text/plain.

Good spot! I've raised TIKA-2445 for this. Should now be fixed - both 
Windows .bat and .cmd should now be detected as application/x-bat, which 
seems to be the closest to a consensus mimetype for them

Nick