You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Krüger,
Sven <Sv...@df-frechen.de> on 2014/06/25 15:22:52 UTC
getLanguage returns "lt" if pdf-file contains only images
Hello,
if a pdf-file only contains graphics without extractable text, getLanguage returns "lt".
Currently I can filter that because the length of the extracted content is 2 * metadata.get("xmpTPg:NPages") - but I don't think this is supposed to work that way.
Is there any way to get a value that indicates the probability of the detected language or another way to get a proper (in this case no) language?
Regards Sven
RE: getLanguage returns "lt" if pdf-file contains only images
Posted by Ken Krugler <kk...@transpac.com>.
Hi Sven,
From your email below, it seems like you get 2 characters per page - can you provide details on what those are?
Thanks,
-- Ken
> From: Krüger, Sven
> Sent: June 25, 2014 6:22:52am PDT
> To: user@tika.apache.org
> Subject: getLanguage returns "lt" if pdf-file contains only images
>
> Hello,
>
> if a pdf-file only contains graphics without extractable text, getLanguage returns "lt".
>
> Currently I can filter that because the length of the extracted content is 2 * metadata.get("xmpTPg:NPages") - but I don‘t think this is supposed to work that way.
>
> Is there any way to get a value that indicates the probability of the detected language or another way to get a proper (in this case no) language?
> Regards Sven
>
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr