You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Matteo Alessandroni <sk...@apache.org> on 2018/02/05 09:30:01 UTC
Detect JSON / PDF specific mime type
I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation.
Unfortunately I don't have other info about the file (e.g. extension).
Is there anything I can do to make Tika be more specific?
I'm currently using this:
Tika tika = new Tika();
tika.setMaxStringLength(-1);
String mimetype = tika.detect(Base64.decode(fileString));
and it gives me "text/plain" for JSON and PDF files, but I would like to obtain a more specific information: "application/json", "application/pdf" etc...
Can you help me with that?
Thanks.
Re: Detect JSON / PDF specific mime type
Posted by Matteo Alessandroni <sk...@apache.org>.
Hi,
thank you for you answer.
I'll check the issue with PDF files.
Best regards,
Matteo
On 2018/02/05 10:12:10, Nick Burch <ap...@gagravarr.org> wrote:
> On Mon, 5 Feb 2018, Matteo Alessandroni wrote:
> > I'm using Apache Tika to detect a file Mime Type from its base64
> > rapresentation. Unfortunately I don't have other info about the file
> > (e.g. extension).
> >
> > and it gives me "text/plain" for JSON and PDF files, but I would like to
> > obtain a more specific information: "application/json",
> > "application/pdf" etc...
>
> You can't detect JSON files from mime magic alone - json doesn't have
> anything unique at the start, just lots of possible different things which
> also occur in other formats too
>
> Tika can detect a PDF from the magic bytes at the start just fine. Make
> sure you're actually decoding the base64 representation properly
>
> Nick
>
Re: Detect JSON / PDF specific mime type
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 5 Feb 2018, Matteo Alessandroni wrote:
> I'm using Apache Tika to detect a file Mime Type from its base64
> rapresentation. Unfortunately I don't have other info about the file
> (e.g. extension).
>
> and it gives me "text/plain" for JSON and PDF files, but I would like to
> obtain a more specific information: "application/json",
> "application/pdf" etc...
You can't detect JSON files from mime magic alone - json doesn't have
anything unique at the start, just lots of possible different things which
also occur in other formats too
Tika can detect a PDF from the magic bytes at the start just fine. Make
sure you're actually decoding the base64 representation properly
Nick