You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2005/06/29 23:36:50 UTC

LanguageIdentifier refactoring

Hi,

In my last LanguageIndentifier patch, I splitted the code, so that the core 
of this plugin could now be viewed as a standalone lib.
I think it could be a good idea to move this language identification lib 
from Nutch to Lucene (in order to be available in Lucene), and that the 
LanguageIdentifier plugin just rely on this Lucene code.
What do you think about that?

Jerome

PS: Looking at Jira issues, it seems that a lot of patches 
(LanguageIdentifier for instance) are not applied to the trunk. What is the 
reason? What is the "process" for applying a patch?


-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: LanguageIdentifier refactoring

Posted by Jérôme Charron <je...@gmail.com>.
> Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they
> both use UTF-8. LanguageIdentifier.identify() seems to be safe, too -
> because it only works with Strings, which are not encoded (native
> Unicode). So, the only place where it would be problematic seems to be
> in the command-line utilities (main methods in both classes), where
> simple change to use InputStreamReader(inputstream, encoding) would fix
> the issue...

In fact, what I see while looking at the code (correct me if I'm wrong) is 
that the Writers and Readers used by Nutch don't take care of the encoding 
(only the HtmlParser performs some encoding detection and add some meta-data 
about encoding).
So, my idea is simply to:
1. Move the encoding detection used in HtmlParser in a more generic place 
(ParseSegment could be a good candidate)
2. Uses the encoding MetaData in all the Read/Write related methods

Seems to be a huge work... but I think it is necessary... no?

Jerome

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: LanguageIdentifier refactoring

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

> I think, this is an issue for all detection mechanisms...
> For the content-type it is the same problem: What is the right value, the 
> one provided by the protocol layer, or the one provided by the extension 
> mapping, or the one provided by the detection (mime-magic)?
> 
> I think, we need to change the actual process, to use auto-detection 
> mechanisms (this is true at least for code that use the language-identifier 
> and the code that use the mime-type identifier). Instead of doing someting 
> like:
> 
> 1. Get info from protocol
> 2. If no info from protocol, get info from parsing
> 3. If no info from parsing, get info from auto-detection
> 
> We need to do something like:
> 
> 1. Get info from protocol
> 2. Get info from parsing
> 3. Get degree of confidences from auto-detection, and checks:
> 3.1 Extracted value from protocol has a high degree of confidence. Take the 
> protocol value
> 3.2 Extracted value from parsing has a high degree of confidence. Take the 
> parsing value
> 3.3 None has a high degree of confidence, but the auto-detection returns 
> another value with a high degree of confidence. Take the auto-detection 
> value.
> 3.4 Take a default value 

Yes, I agree.

>>* modify the identify() method to return a pair of lang code + relative
>>score (normalized to 0..1)
> 
> 
> What do you think about returning a sorted array of lang/score pair?

Yes, that would make sense too. I've been working with a proprietary 
language detection tool (based on similar principles), and it was also 
returning a sorted array.

> For information, there's some other issues on the language-identifier:
> I was focused on performance and precision, and now, that I run it outside 
> of the "lab", and performs some tests in real life, with real documents, I 
> saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
> I discovered this issue and analyze it yesterday: With UTF-8 encoded input 
> documents, you get some very fine identification, but with another encoding 
> it is a disaster.

Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they 
both use UTF-8. LanguageIdentifier.identify() seems to be safe, too - 
because it only works with Strings, which are not encoded (native 
Unicode). So, the only place where it would be problematic seems to be 
in the command-line utilities (main methods in both classes), where 
simple change to use InputStreamReader(inputstream, encoding) would fix 
the issue...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: LanguageIdentifier refactoring

Posted by Jérôme Charron <je...@gmail.com>.
> I have an issue with the language detection plugin, which I'm not sure
> how to address. The plugin first tries to extract the language
> identifier from meta tags. However, meta tag values people put there are
> often completely wrong, or follow obscure pseudo-standards.
> 
> Example: there is a bunch of pages, generated by Frontpage, where author
> apparently forgot to change the default settings. So, the meta tags say
> "en-us", while the real content of the page is in Spanish. The
> identify() method shows this clearly.


The final value put in X-meta-lang is "en-us". Now, the question is -
> should the plugin override that value with the one from the
> auto-detection? This means that it should always run the detection
> step... Can we have more confidence in our detection mechanism than in
> the author's knowledge? Well, perhaps, if for content longer than xxx
> bytes the detection is nearly unambiguous.

I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, the 
one provided by the protocol layer, or the one provided by the extension 
mapping, or the one provided by the detection (mime-magic)?

I think, we need to change the actual process, to use auto-detection 
mechanisms (this is true at least for code that use the language-identifier 
and the code that use the mime-type identifier). Instead of doing someting 
like:

1. Get info from protocol
2. If no info from protocol, get info from parsing
3. If no info from parsing, get info from auto-detection

We need to do something like:

1. Get info from protocol
2. Get info from parsing
3. Get degree of confidences from auto-detection, and checks:
3.1 Extracted value from protocol has a high degree of confidence. Take the 
protocol value
3.2 Extracted value from parsing has a high degree of confidence. Take the 
parsing value
3.3 None has a high degree of confidence, but the auto-detection returns 
another value with a high degree of confidence. Take the auto-detection 
value.
3.4 Take a default value 

Another example: for a bunch of pages in Swedish, I collected the
> following values of X-meta-lang:
> 
> (SCHEME=ISO.639-1) sv
> (SCHEME=ISO639-1) sv
> (SCHEME=RFC1766) sv-FI
> (SCHEME=Z39.53) SWE
> EN_US, SV, EN, EN_UK
> English Swedish
> English, swedish
> English,Swedish
> Other (Svenska)
> SE
> SV
> SV charset=iso-8859-1
> SV-FI
> SV; charset=iso-8859-1
> SVE
> SW
> SWE
> SWEDISH
> Sv
> Sve
> Svenska
> Swedish
> Swedish, svenska
> en, sv
> se
> se, en
> se,en,de
> se-sv
> sv
> sv, be, dk, de, fr, no, pt, ch, fi, en
> sv, dk, fi, gl, is, fo
> sv, dk, no
> sv, en
> sv, eng
> sv, eng, de
> sv, fr, eng
> sv, nl
> sv, no, de
> sv, no, en, de, dk, fi
> sv,en
> sv,en,de,fr
> sv,eng
> sv,eng,de,fr
> sv,no,fi
> sv-FI
> sv-SE
> sv-en
> sv-fi
> sv-se
> sv; Content-Language: sv
> sv_SE
> sve
> svenska
> svenska, swedish, engelska, english, norsk, norwegian, polska, polish
> sw
> swe
> swe.SPR.
> sweden
> swedish
> swedish,
> text/html; charset=sv-SE
> text/html; sv
> torp, stuga, uthyres, bed & breakfast
> In all cases the value from the detection routine was unambiguous - 
> swedish.

Yes, I recently saw this problem while analyzing my indexes... 
A first step, could be to improve the Content-language / dc.language / html 
lang parsers.
(It could be done in the HTMLLanguageParser)

In this light, I propose the following changes:
> 
> * modify the identify() method to return a pair of lang code + relative
> score (normalized to 0..1)

What do you think about returning a sorted array of lang/score pair?

> * in HTMLLanguageParser we should always run
> LanguageIdentifier.identify(parse.getText())

Yes! 

For information, there's some other issues on the language-identifier:
I was focused on performance and precision, and now, that I run it outside 
of the "lab", and performs some tests in real life, with real documents, I 
saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
I discovered this issue and analyze it yesterday: With UTF-8 encoded input 
documents, you get some very fine identification, but with another encoding 
it is a disaster.
Sami (I think you were the original and first coder of the 
LanguageIdentifierPlugin), do you already know this problem? Do you have 
some ideas about solving it?
Actually, it is a very big issue, and the language-identifier can not be 
used on a real crawl.

Thanks Andrzej for your feed-back and ideas.
(I will continue to focus my work on the encoding problem, but once I can 
commit, I will implement the changes you suggest in this mail)

In fact, there's still a lot of TODOs in the languageidentifier => the most 
I work on it, the most I see some issues to fix, but it is a very important 
module if we want to add Multi-lingual support in Nutch.
So, I will update Wiki pages about language identifier in order to keep 
trace of all these fixes/ideas/issues....

Best Regards

Jerome

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: LanguageIdentifier refactoring

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jerome,

I have an issue with the language detection plugin, which I'm not sure 
how to address. The plugin first tries to extract the language 
identifier from meta tags. However, meta tag values people put there are 
  often completely wrong, or follow obscure pseudo-standards.

Example: there is a bunch of pages, generated by Frontpage, where author 
apparently forgot to change the default settings. So, the meta tags say 
"en-us", while the real content of the page is in Spanish. The 
identify() method shows this clearly.

The final value put in X-meta-lang is "en-us". Now, the question is - 
should the plugin override that value with the one from the 
auto-detection? This means that it should always run the detection 
step... Can we have more confidence in our detection mechanism than in 
the author's knowledge? Well, perhaps, if for content longer than xxx 
bytes the detection is nearly unambiguous.

Another example: for a bunch of pages in Swedish, I collected the 
following values of X-meta-lang:

(SCHEME=ISO.639-1) sv
(SCHEME=ISO639-1) sv
(SCHEME=RFC1766) sv-FI
(SCHEME=Z39.53) SWE
EN_US, SV, EN, EN_UK
English Swedish
English, swedish
English,Swedish
Other (Svenska)
SE
SV
SV charset=iso-8859-1
SV-FI
SV; charset=iso-8859-1
SVE
SW
SWE
SWEDISH
Sv
Sve
Svenska
Swedish
Swedish, svenska
en, sv
se
se, en
se,en,de
se-sv
sv
sv, be, dk, de, fr, no, pt, ch, fi, en
sv, dk, fi, gl, is, fo
sv, dk, no
sv, en
sv, eng
sv, eng, de
sv, fr, eng
sv, nl
sv, no, de
sv, no, en, de, dk, fi
sv,en
sv,en,de,fr
sv,eng
sv,eng,de,fr
sv,no,fi
sv-FI
sv-SE
sv-en
sv-fi
sv-se
sv; Content-Language: sv
sv_SE
sve
svenska
svenska, swedish, engelska, english, norsk, norwegian, polska, polish
sw
swe
swe.SPR.
sweden
swedish
swedish,
text/html; charset=sv-SE
text/html; sv
torp, stuga, uthyres, bed & breakfast


In all cases the value from the detection routine was unambiguous - swedish.

In this light, I propose the following changes:

* modify the identify() method to return a pair of lang code + relative 
score (normalized to 0..1)

* in HTMLLanguageParser we should always run 
LanguageIdentifier.identify(parse.getText())

* if the meta tag is null, we take the value from identify()

* if the value from identify() is null, we take the meta tag value.

* if the meta tag is not null and the value from identify() is not null:

	* if the content is shorter than "lang.analyze.max.length",
	  we take the meta tag value

	* else, if the meta tag and identify values are different:

		* if the score from identify() is above "certainty"
		  threshold (0.8?), we take the value from identify().

		* elsee, we take the meta tag value.

Similar changes would be needed in LanguageIndexingFilter.filter(), to 
handle text coming from other content types.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: LanguageIdentifier refactoring

Posted by Jérôme Charron <je...@gmail.com>.
> 
> I monitor your work, and as soon as you say "go" I'm ready to apply the
> patches - but I'd rather avoid doing this every couple of days. So, for
> now, I'm waiting for a more or less stable situation... ;-)

Ok Andrzej,

the last patch seems to be stable. I perform some functional tests on around 
200000 docs, and it seems to be ok.
So, "Go" ... feel free to apply the last patch. ;-)
Thanks.

Jerome
-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: LanguageIdentifier refactoring

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
> Hi,
> 
> In my last LanguageIndentifier patch, I splitted the code, so that the core 
> of this plugin could now be viewed as a standalone lib.
> I think it could be a good idea to move this language identification lib 
> from Nutch to Lucene (in order to be available in Lucene), and that the 
> LanguageIdentifier plugin just rely on this Lucene code.
> What do you think about that?
> 
> Jerome
> 
> PS: Looking at Jira issues, it seems that a lot of patches 
> (LanguageIdentifier for instance) are not applied to the trunk. What is the 
> reason? What is the "process" for applying a patch?
> 
> 

I monitor your work, and as soon as you say "go" I'm ready to apply the 
patches - but I'd rather avoid doing this every couple of days. So, for 
now, I'm waiting for a more or less stable situation... ;-)


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com