You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Matteo Alessandroni <sk...@apache.org> on 2018/02/05 09:30:01 UTC

Detect JSON / PDF specific mime type

I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. 
Unfortunately I don't have other info about the file (e.g. extension).

Is there anything I can do to make Tika be more specific?
I'm currently using this:

Tika tika = new Tika();
tika.setMaxStringLength(-1);
String mimetype = tika.detect(Base64.decode(fileString));

and it gives me "text/plain" for JSON and PDF files, but I would like to obtain a more specific information: "application/json", "application/pdf" etc...

Can you help me with that?

Thanks.


Re: Detect JSON / PDF specific mime type

Posted by Matteo Alessandroni <sk...@apache.org>.
Hi,

thank you for you answer.
I'll check the issue with PDF files.

Best regards,
Matteo

On 2018/02/05 10:12:10, Nick Burch <ap...@gagravarr.org> wrote: 
> On Mon, 5 Feb 2018, Matteo Alessandroni wrote:
> > I'm using Apache Tika to detect a file Mime Type from its base64 
> > rapresentation. Unfortunately I don't have other info about the file 
> > (e.g. extension).
> >
> > and it gives me "text/plain" for JSON and PDF files, but I would like to 
> > obtain a more specific information: "application/json", 
> > "application/pdf" etc...
> 
> You can't detect JSON files from mime magic alone - json doesn't have 
> anything unique at the start, just lots of possible different things which 
> also occur in other formats too
> 
> Tika can detect a PDF from the magic bytes at the start just fine. Make 
> sure you're actually decoding the base64 representation properly
> 
> Nick
> 

Re: Detect JSON / PDF specific mime type

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 5 Feb 2018, Matteo Alessandroni wrote:
> I'm using Apache Tika to detect a file Mime Type from its base64 
> rapresentation. Unfortunately I don't have other info about the file 
> (e.g. extension).
>
> and it gives me "text/plain" for JSON and PDF files, but I would like to 
> obtain a more specific information: "application/json", 
> "application/pdf" etc...

You can't detect JSON files from mime magic alone - json doesn't have 
anything unique at the start, just lots of possible different things which 
also occur in other formats too

Tika can detect a PDF from the magic bytes at the start just fine. Make 
sure you're actually decoding the base64 representation properly

Nick