You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Krüger, Sven <Sv...@df-frechen.de> on 2014/06/25 15:22:52 UTC

getLanguage returns "lt" if pdf-file contains only images

Hello,

if a pdf-file only contains graphics without extractable text, getLanguage returns "lt".

Currently I can filter that because the length of the extracted content is 2 * metadata.get("xmpTPg:NPages") - but I don't think this is supposed to work that way.

Is there any way to get a value that indicates the probability of  the detected language or another way to get a proper (in this case no) language?
Regards Sven

RE: getLanguage returns "lt" if pdf-file contains only images

Posted by Ken Krugler <kk...@transpac.com>.

Hi Sven,

From your email below, it seems like you get 2 characters per page - can you provide details on what those are?

Thanks,

-- Ken

> From: Krüger, Sven
> Sent: June 25, 2014 6:22:52am PDT
> To: user@tika.apache.org
> Subject: getLanguage returns "lt" if pdf-file contains only images
> 
> Hello,
>  
> if a pdf-file only contains graphics without extractable text, getLanguage returns "lt".
>  
> Currently I can filter that because the length of the extracted content is 2 * metadata.get("xmpTPg:NPages") - but I don‘t think this is supposed to work that way.
>  
> Is there any way to get a value that indicates the probability of  the detected language or another way to get a proper (in this case no) language?
> Regards Sven
>  

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr