You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/07/17 14:58:27 UTC

Garbage with languageidentifier

Hi,

I've found a lot of garbage produced by the language identifier, most likely 
caused by it relying on HTTP-header as the first hint for the language.

Instead of a nice tight list of ISO-codes i've got an index full of garbage 
making me unable to select a language. The lang field now contains a mess 
including ISO-codes of various types (nl | ned, nl-NL | nederlands | 
Nederlands | dutch | Dutch etc etc) and even comma-separated combinations. 
It's impossible to do a simple fq:lang:nl due to this undeterminable set of 
language identifiers. Apart from language identifiers that we as human 
understand the headers also contains values such as {$plugin.meta.language} | 
Weerribben zuivel | Array or complete sentences and even MIME-types and more 
nonsens you can laugh about.

Why do we rely on HTTP-header at all? Isn't it well-known that only very few 
developers and content management systems actually care about returning proper 
information in HTTP headers?  This actually also goes for finding out content-
type, which is a similar problem in the index.

I know work is going on in Tika for improving MIME-type detection i'm not sure 
if this is true for language identification. We still have to rely on the Nutch 
plugin to do this work, right? If so, i propose to make it configurable so we 
can choose if we wan't to rely on the current behaviour or do N-gram detection 
straight-away.

Comments?

Thanks

Re: Garbage with languageidentifier

Posted by Ken Krugler <ke...@krugler.org>.

Hi Markus,

> The proposal is to configure the order of detection: meta,header,identifier 
> (which is the current order).

This issue of precedence also comes up when detecting charset information. From an earlier post I'd made to the Nutch list:

> See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issue I'm currently working on, which has to do with the charset detection algorithm.
> 
> There's the HTML5 proposal, where the priority is
> 
> - charset from Content-Type response header
> - charset from HTML <meta http-equiv content-type> element
> - charset detected from page contents
> 
> Reinhard Schwab proposed a variation on the HTML5 approach, which makes sense to me; in my web crawling experience, too many servers lie to just blindly trust the response header contents.
> 
> I've got a slight modification to Reinhard's approach, as describe in a comment on the above issue:
> 
> https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12928832
> 
> I'm interested in comments.

See http://tools.ietf.org/html/draft-abarth-mime-sniff-03 for a writeup on how to extract charset info, which seems relevant to how to detect language as well.

-- Ken

On Jul 17, 2011, at 6:04am, Markus Jelsma wrote:

> 
>> Hi,
>> 
>> I've found a lot of garbage produced by the language identifier, most
>> likely caused by it relying on HTTP-header as the first hint for the
>> language.
>> 
>> Instead of a nice tight list of ISO-codes i've got an index full of garbage
>> making me unable to select a language. The lang field now contains a mess
>> including ISO-codes of various types (nl | ned, nl-NL | nederlands |
>> Nederlands | dutch | Dutch etc etc) and even comma-separated combinations.
>> It's impossible to do a simple fq:lang:nl due to this undeterminable set of
>> language identifiers. Apart from language identifiers that we as human
>> understand the headers also contains values such as {$plugin.meta.language}
>> | Weerribben zuivel | Array or complete sentences and even MIME-types and
>> more nonsens you can laugh about.
>> 
>> Why do we rely on HTTP-header at all? Isn't it well-known that only very
>> few developers and content management systems actually care about
>> returning proper information in HTTP headers?  This actually also goes for
>> finding out content- type, which is a similar problem in the index.
>> 
>> I know work is going on in Tika for improving MIME-type detection i'm not
>> sure if this is true for language identification. We still have to rely on
>> the Nutch plugin to do this work, right? If so, i propose to make it
>> configurable so we can choose if we wan't to rely on the current behaviour
>> or do N-gram detection straight-away.
>> 
>> Comments?
>> 
>> Thanks

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Re: Garbage with languageidentifier

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Markus,

I think this is a good shout, and it is not hard to understand the points
you make. Quite clearly, good practice relating to the inclusion of accurate
and useful language information (as well as other types of information) in
HTTP headers is not a reality and it wouldn't be suitable for us to pretend
as if this was not the case.

One thing to note though, I just found out yesterday that language detection
in trunk has been passed to Tika but this is not the case with branch 1.4.
It's not my intention to put words into peoples mouth's, however by the
looks of the conversation in NUTCH-657 I foresee that delegating
language-identification to Tika and making branch-1.4 consistent with trunk
would be the next move? Am I correct here? please say otherwise if this is
not the case.

If this is the plan then is there any requirement for Nutch to have an
independent language detection plugin? If we can address why the decision
was made for trunk to rely upon tika for language detection then we can
justify where we are with the comments you make. To be honest I am seeing a
medium sized grey area here, however this has to do with my inexperience
dealing with the language detection plugin and of the problems you mention.

On Sun, Jul 17, 2011 at 2:04 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> The proposal is to configure the order of detection: meta,header,identifier
> (which is the current order).
>
> > Hi,
> >
> > I've found a lot of garbage produced by the language identifier, most
> > likely caused by it relying on HTTP-header as the first hint for the
> > language.
> >
> > Instead of a nice tight list of ISO-codes i've got an index full of
> garbage
> > making me unable to select a language. The lang field now contains a mess
> > including ISO-codes of various types (nl | ned, nl-NL | nederlands |
> > Nederlands | dutch | Dutch etc etc) and even comma-separated
> combinations.
> > It's impossible to do a simple fq:lang:nl due to this undeterminable set
> of
> > language identifiers. Apart from language identifiers that we as human
> > understand the headers also contains values such as
> {$plugin.meta.language}
> > | Weerribben zuivel | Array or complete sentences and even MIME-types and
> > more nonsens you can laugh about.
> >
> > Why do we rely on HTTP-header at all? Isn't it well-known that only very
> > few developers and content management systems actually care about
> > returning proper information in HTTP headers?  This actually also goes
> for
> > finding out content- type, which is a similar problem in the index.
> >
> > I know work is going on in Tika for improving MIME-type detection i'm not
> > sure if this is true for language identification. We still have to rely
> on
> > the Nutch plugin to do this work, right? If so, i propose to make it
> > configurable so we can choose if we wan't to rely on the current
> behaviour
> > or do N-gram detection straight-away.
> >
> > Comments?
> >
> > Thanks
>

-- 
*Lewis*

Re: Garbage with languageidentifier

Posted by Markus Jelsma <ma...@openindex.io>.

The proposal is to configure the order of detection: meta,header,identifier 
(which is the current order).

> Hi,
> 
> I've found a lot of garbage produced by the language identifier, most
> likely caused by it relying on HTTP-header as the first hint for the
> language.
> 
> Instead of a nice tight list of ISO-codes i've got an index full of garbage
> making me unable to select a language. The lang field now contains a mess
> including ISO-codes of various types (nl | ned, nl-NL | nederlands |
> Nederlands | dutch | Dutch etc etc) and even comma-separated combinations.
> It's impossible to do a simple fq:lang:nl due to this undeterminable set of
> language identifiers. Apart from language identifiers that we as human
> understand the headers also contains values such as {$plugin.meta.language}
> | Weerribben zuivel | Array or complete sentences and even MIME-types and
> more nonsens you can laugh about.
> 
> Why do we rely on HTTP-header at all? Isn't it well-known that only very
> few developers and content management systems actually care about
> returning proper information in HTTP headers?  This actually also goes for
> finding out content- type, which is a similar problem in the index.
> 
> I know work is going on in Tika for improving MIME-type detection i'm not
> sure if this is true for language identification. We still have to rely on
> the Nutch plugin to do this work, right? If so, i propose to make it
> configurable so we can choose if we wan't to rely on the current behaviour
> or do N-gram detection straight-away.
> 
> Comments?
> 
> Thanks