You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2006/01/20 17:37:32 UTC

lang identifier and nutch analyzer in trunk

Hi All

I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).

In org.apache.nutch.indexer.Indexer.class line 104

writer.addDocument((Document)((ObjectWritable)value).get());

It should be

NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );

right?

Once more,query parsing should call AnalyzerFactory?? The query input
is multi-lingual also.

Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: lang identifier and nutch analyzer in trunk

Posted by Jack Tang <hi...@gmail.com>.
On 1/21/06, Jack Tang <hi...@gmail.com> wrote:
> Hi All
>
> I am wondering Analyzer of nutch in svn trunk is chosen by
> languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
>
> In org.apache.nutch.indexer.Indexer.class line 104
>
> writer.addDocument((Document)((ObjectWritable)value).get());
>
> It should be
>
> NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
> writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );

Sorry, it should be

        	Document doc = (Document)((ObjectWritable)value).get();
        	NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
                writer.addDocument(doc, analyzer);

> right?
>
> Once more,query parsing should call AnalyzerFactory?? The query input
> is multi-lingual also.
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: lang identifier and nutch analyzer in trunk

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
>> We're going back to the old discussion - most web pages out there either
>> don't have these tags at all, or even if they have it it contains wrong
>> values ... so, I think this policy is not going to give the best results.
>>     
>
> Yes I know Andrzej, it was just to explain to Jack how it actually works
>
>   

Ok.

>> IMHO we should always try to guess the language if we have enough text,
>> unless we can be sure that we deal with properly marked documents (not
>> such uncommon case in Intranets).
>>     
>
> I think we should have something like in the MimeType detection:
> If a meta data is found, then checks that it is the correct value regarding
> the score of this language (statistical analyis).
> If the score is too low or no meta data is found, then we perform a full
> statistical analysis.
> No?
>   
Yes :-)


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: lang identifier and nutch analyzer in trunk

Posted by Jérôme Charron <je...@gmail.com>.
> We're going back to the old discussion - most web pages out there either
> don't have these tags at all, or even if they have it it contains wrong
> values ... so, I think this policy is not going to give the best results.

Yes I know Andrzej, it was just to explain to Jack how it actually works


> IMHO we should always try to guess the language if we have enough text,
> unless we can be sure that we deal with properly marked documents (not
> such uncommon case in Intranets).

I think we should have something like in the MimeType detection:
If a meta data is found, then checks that it is the correct value regarding
the score of this language (statistical analyis).
If the score is too low or no meta data is found, then we perform a full
statistical analysis.
No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
>> Is it reasonable to guess language info. from target servers geographical
>> info.?
>>     
>
> Yes, it could be another clue to guess language.
> But the problem is then to find how to use all these indices.
>
> For instance, the actual solution is the easiest one, but certainly not the
> more efficient one:
> For HTML documents, the HTMLLanguageParser scans HTML documents looking at
> possible indications of content language:
> 1. html lang attribute
> 2. meta dc.language
> 3. meta http-equiv
> The first one found is assumed to be the document's language.
> Then if no language is found, the statistical language identifier is
> used....
>   

We're going back to the old discussion - most web pages out there either 
don't have these tags at all, or even if they have it it contains wrong 
values ... so, I think this policy is not going to give the best results.

IMHO we should always try to guess the language if we have enough text, 
unless we can be sure that we deal with properly marked documents (not 
such uncommon case in Intranets).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: lang identifier and nutch analyzer in trunk

Posted by Jérôme Charron <je...@gmail.com>.
> Is it reasonable to guess language info. from target servers geographical
> info.?

Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.

For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For HTML documents, the HTMLLanguageParser scans HTML documents looking at
possible indications of content language:
1. html lang attribute
2. meta dc.language
3. meta http-equiv
The first one found is assumed to be the document's language.
Then if no language is found, the statistical language identifier is
used....

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

Posted by Jack Tang <hi...@gmail.com>.
Hi

Is it reasonable to guess language info. from target servers geographical info.?

/Jack

On 1/23/06, Jérôme Charron <je...@gmail.com> wrote:
> > Any plan to implement this ? I mean move LanguageIdentifier class
> > intto nutch core.
>
> As I already suggested it on this list, I really would like to move the
> LanguageIdentifier class (and profiles) to
> an independant Lucene sub-project (and the MimeType repository too).
> I don't remember why but there were some objections about this...
>
> Here is a short status of what I have in mind for next improvements with the
> LanguageIdentifier / MultiLanguage support :
> * Enhance LanguageIdentifier APIs by returning something like an ordered
> LangDetail[] array when guessing language (each LangDetail should contains
> the language code and its score) - I have a prototype version of this on my
> disk but I doesn't take time to finalize it
> * I encountered some identification problems with some specific sites (with
> blogger for instance), and I plan to investigate on this point.
> * Another pending task : the analysis (and coding) of multilingual querying
> support.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: lang identifier and nutch analyzer in trunk

Posted by Jérôme Charron <je...@gmail.com>.
> > I would like to decouple Lang Id from Nutch and move it in Lucene
> contrib/ in the near future.
> > Does that sound ok?
> +1 from me.

+1 from me too
(if I can have a commit access to contrib code)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

Posted by Andrzej Bialecki <ab...@getopt.org>.
ogjunk-nutch@yahoo.com wrote:
> I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future.
>
> Does that sound ok?
>   

+1 from me.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: lang identifier and nutch analyzer in trunk

Posted by og...@yahoo.com.
I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future.

Does that sound ok?

Otis


----- Original Message ----
From: Stefan Groschupf <sg...@media-style.com>
To: nutch-dev@lucene.apache.org
Sent: Mon 23 Jan 2006 02:55:46 PM EST
Subject: Re: lang identifier and nutch analyzer in trunk

>> As I already suggested it on this list, I really would like to  
>> move the
>> LanguageIdentifier class (and profiles) to
>> an independant Lucene sub-project (and the MimeType repository too).
>> I don't remember why but there were some objections about this...
>>
>>
>
> I think most people agree that it would be worthwhile to un-tie  
> this component from Nutch internals. The only objections were  
> related not to the idea itself, but to the management aspects of  
> creating a full-blown sub-project, both wrt. to the initial setup  
> and the continuing maintenance. An alternative solution was  
> proposed (creating a contrib/ package). This would still help to  
> separate the code from Nutch internals, so that it can be used in  
> other projects, but it would require much less effort to set up and  
> maintain.

+1, what's about lucene sandbox or jsut open a source forge project  
with Apache 2 license, than we can use just the jar.

Stefan







Re: lang identifier and nutch analyzer in trunk

Posted by Stefan Groschupf <sg...@media-style.com>.
>> As I already suggested it on this list, I really would like to  
>> move the
>> LanguageIdentifier class (and profiles) to
>> an independant Lucene sub-project (and the MimeType repository too).
>> I don't remember why but there were some objections about this...
>>
>>
>
> I think most people agree that it would be worthwhile to un-tie  
> this component from Nutch internals. The only objections were  
> related not to the idea itself, but to the management aspects of  
> creating a full-blown sub-project, both wrt. to the initial setup  
> and the continuing maintenance. An alternative solution was  
> proposed (creating a contrib/ package). This would still help to  
> separate the code from Nutch internals, so that it can be used in  
> other projects, but it would require much less effort to set up and  
> maintain.

+1, what's about lucene sandbox or jsut open a source forge project  
with Apache 2 license, than we can use just the jar.

Stefan




Re: lang identifier and nutch analyzer in trunk

Posted by Jérôme Charron <je...@gmail.com>.
> +1. Other local modifications which I use frequently:
>
> * exporting a list of supported languages,
>
> * exporting an NGramProfile of the analyzed text,
>
> * allow processing of chunks of input (i.e.
> LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
> very useful if the text to be analyzed is already present in memory, and
> the choice of sections (chunks) is made elsewhere, e.g. for documents
> with clearly outlined sections, or for multi-language documents.

Thanks for these intereseting comments Andrzej => I add them to my todo
list.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
>> Any plan to implement this ? I mean move LanguageIdentifier class
>> intto nutch core.
>>     
>
> As I already suggested it on this list, I really would like to move the
> LanguageIdentifier class (and profiles) to
> an independant Lucene sub-project (and the MimeType repository too).
> I don't remember why but there were some objections about this...
>
>   

I think most people agree that it would be worthwhile to un-tie this 
component from Nutch internals. The only objections were related not to 
the idea itself, but to the management aspects of creating a full-blown 
sub-project, both wrt. to the initial setup and the continuing 
maintenance. An alternative solution was proposed (creating a contrib/ 
package). This would still help to separate the code from Nutch 
internals, so that it can be used in other projects, but it would 
require much less effort to set up and maintain.

> Here is a short status of what I have in mind for next improvements with the
> LanguageIdentifier / MultiLanguage support :
> * Enhance LanguageIdentifier APIs by returning something like an ordered
> LangDetail[] array when guessing language (each LangDetail should contains
> the language code and its score) - I have a prototype version of this on my
> disk but I doesn't take time to finalize it
>   

+1. Other local modifications which I use frequently:

* exporting a list of supported languages,

* exporting an NGramProfile of the analyzed text,

* allow processing of chunks of input (i.e. 
LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is 
very useful if the text to be analyzed is already present in memory, and 
the choice of sections (chunks) is made elsewhere, e.g. for documents 
with clearly outlined sections, or for multi-language documents.

> * I encountered some identification problems with some specific sites (with
> blogger for instance), and I plan to investigate on this point.
> * Another pending task : the analysis (and coding) of multilingual querying
> support.
>   

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: lang identifier and nutch analyzer in trunk

Posted by Jérôme Charron <je...@gmail.com>.
> Any plan to implement this ? I mean move LanguageIdentifier class
> intto nutch core.

As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...

Here is a short status of what I have in mind for next improvements with the
LanguageIdentifier / MultiLanguage support :
* Enhance LanguageIdentifier APIs by returning something like an ordered
LangDetail[] array when guessing language (each LangDetail should contains
the language code and its score) - I have a prototype version of this on my
disk but I doesn't take time to finalize it
* I encountered some identification problems with some specific sites (with
blogger for instance), and I plan to investigate on this point.
* Another pending task : the analysis (and coding) of multilingual querying
support.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

Posted by Jack Tang <hi...@gmail.com>.
Hi Jérôme

On 1/21/06, Jérôme Charron <je...@gmail.com> wrote:
> > I am wondering Analyzer of nutch in svn trunk is chosen by
> > languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
>
> It's not really choosen by the languageidentifier, but coosen regarding the
> value of the lang attribute (for now, that's right, only the
> languageidentifier add this attribute).
>
>
> > In org.apache.nutch.indexer.Indexer.class line 104
> > writer.addDocument((Document)((ObjectWritable)value).get());
> > It should be
> > NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
> > writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );
> > right?
>
> Yes, it should.
> Thanks for noticing this.
> Merge problem?
> (I don't remember to add this in nutch-0.7 ...)
>
>
> > Once more,query parsing should call AnalyzerFactory?? The query input
> > is multi-lingual also.
>
> The query part is not yet implemented.

Any plan to implement this ? I mean move LanguageIdentifier class
intto nutch core.

Thanks
/Jack

> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: lang identifier and nutch analyzer in trunk

Posted by Jérôme Charron <je...@gmail.com>.
> I am wondering Analyzer of nutch in svn trunk is chosen by
> languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).

It's not really choosen by the languageidentifier, but coosen regarding the
value of the lang attribute (for now, that's right, only the
languageidentifier add this attribute).


> In org.apache.nutch.indexer.Indexer.class line 104
> writer.addDocument((Document)((ObjectWritable)value).get());
> It should be
> NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get("lang"));
> writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );
> right?

Yes, it should.
Thanks for noticing this.
Merge problem?
(I don't remember to add this in nutch-0.7 ...)


> Once more,query parsing should call AnalyzerFactory?? The query input
> is multi-lingual also.

The query part is not yet implemented.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/