You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Wilm Schumacher <wi...@gmail.com> on 2015/03/04 23:32:19 UTC

LanguageIdentifier.isReasonablyCertain is always false

Hi,

I'm very new to tika and just start using it ... AND I LOVE IT!

I want to use the language detector for choosing the stemming in my full
text search engine. My plan was to use the specific stemmer (e.g.
"german2") if getLanguage returns "de". However, as getLanguage always
returns something, e.g. "lt" for the content "abc", my plan was to stem
with the specific stemmer if tika is certain, and if not not stemm at all.

However, with my first tests I found that
LanguageIdentifier.isReasonablyCertain always returns false. I found
some JIRA issues and comments about that, e.g.
https://issues.apache.org/jira/browse/TIKA-568, but no real explaination
or solution.

I used some german "lore ipsum" => isReasonablyCertain = false.

I used the "declaration of human rights" in german, as suggested in the
book "tika in action". isReasonablyCertain = false.

I even used the book "tika in action" itself ;). getLanguage = en, but
isReasonablyCertain = false.

The latter two bug me, as both texts are well written in their resp.
language and are reasonable big. Below is the code snippet I used for
testing. Is something wrong with that? Or should I ignore
isReasonablyCertain and find another way of detecting weather the
getLanguage output should be trusted? Or should I always index stemmed
and not stemmed as this question is not really answerable? Any insight
is appreciated.

Best wishes,

Wilm

ps: code snippet i used:

==

String fileName = ...

Tika tika = new Tika();
		
InputStream is = new FileInputStream( fileName );	
String content = tika.parseToString( is );
		
LanguageIdentifier identifier = new LanguageIdentifier( content );
		
System.out.println( identifier.getLanguage() );
System.out.println( identifier.isReasonablyCertain() );

Re: LanguageIdentifier.isReasonablyCertain is always false

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 4 Mar 2015, Wilm Schumacher wrote:
> I want to use the language detector for choosing the stemming in my full 
> text search engine. My plan was to use the specific stemmer (e.g. 
> "german2") if getLanguage returns "de". However, as getLanguage always 
> returns something, e.g. "lt" for the content "abc", my plan was to stem 
> with the specific stemmer if tika is certain, and if not not stemm at 
> all.

Generally, short phrases are hard to identify, as there are too many 
languages that are similar for just short bits of content. Normally you 
need to give a few kb of text

>From the javadocs of isReasonablyCertain():
WARNING: Will never return true for small amount of input texts.


> I used the "declaration of human rights" in german, as suggested in the
> book "tika in action". isReasonablyCertain = false.
>
> I even used the book "tika in action" itself ;). getLanguage = en, but
> isReasonablyCertain = false.

Hmm, I would've expected those two to work


> LanguageIdentifier identifier = new LanguageIdentifier( content );

Can you try stepping into that with a debugger, and see how the various 
standard language profiles it compares your text against come out for 
distance?

Thanks
Nick

RE: LanguageIdentifier.isReasonablyCertain is always false

Posted by Ken Krugler <kk...@transpac.com>.

Hi Wilm,

Sorry for the long delay in following up - I finally got around to working on the issue of language identification in Tika.

Most of the work is happening as part of https://issues.apache.org/jira/browse/TIKA-1723, which integrates a 3rd party language identification package (language-detector).

This will solve the issue of isReasonablyCertain() always returning false…and I've added tests to confirm :)

Regards,

-- Ken

> From: Wilm Schumacher
> Sent: March 4, 2015 2:32:19pm PST
> To: user@tika.apache.org
> Subject: LanguageIdentifier.isReasonablyCertain is always false
> 
> Hi,
> 
> I'm very new to tika and just start using it ... AND I LOVE IT!
> 
> I want to use the language detector for choosing the stemming in my full
> text search engine. My plan was to use the specific stemmer (e.g.
> "german2") if getLanguage returns "de". However, as getLanguage always
> returns something, e.g. "lt" for the content "abc", my plan was to stem
> with the specific stemmer if tika is certain, and if not not stemm at all.
> 
> However, with my first tests I found that
> LanguageIdentifier.isReasonablyCertain always returns false. I found
> some JIRA issues and comments about that, e.g.
> https://issues.apache.org/jira/browse/TIKA-568, but no real explaination
> or solution.
> 
> I used some german "lore ipsum" => isReasonablyCertain = false.
> 
> I used the "declaration of human rights" in german, as suggested in the
> book "tika in action". isReasonablyCertain = false.
> 
> I even used the book "tika in action" itself ;). getLanguage = en, but
> isReasonablyCertain = false.
> 
> The latter two bug me, as both texts are well written in their resp.
> language and are reasonable big. Below is the code snippet I used for
> testing. Is something wrong with that? Or should I ignore
> isReasonablyCertain and find another way of detecting weather the
> getLanguage output should be trusted? Or should I always index stemmed
> and not stemmed as this question is not really answerable? Any insight
> is appreciated.
> 
> Best wishes,
> 
> Wilm
> 
> ps: code snippet i used:
> 
> ==
> 
> String fileName = ...
> 
> Tika tika = new Tika();
> 		
> InputStream is = new FileInputStream( fileName );	
> String content = tika.parseToString( is );
> 		
> LanguageIdentifier identifier = new LanguageIdentifier( content );
> 		
> System.out.println( identifier.getLanguage() );
> System.out.println( identifier.isReasonablyCertain() );
> 




--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr