You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chee Wu <ch...@gmail.com> on 2007/01/07 10:12:28 UTC

Nutch .81: the process to add a new analyzer ?

Hi,
     I am trying to add a new analyzer for Chinese,and I found the
code below in the "org.apache.nutch.indexer.Indexer"

    public void write(WritableComparable key, Writable value)
            throws IOException {                  // unwrap & index doc
            Document doc = (Document)((ObjectWritable)value).get();
            NutchAnalyzer analyzer = factory.get(doc.get("lang"));
            if (LOG.isInfoEnabled()) {
              LOG.info(" Indexing [" + doc.getField("url").stringValue() + "]" +
                       " with analyzer " + analyzer +
                       " (" + doc.get("lang") + ")");
            }
            writer.addDocument(doc, analyzer);
          }
The question of mine is:
For doc.get("lang"). Where and how can I  set the  "lang" property for
the doc ? I also find the
http://wiki.apache.org/nutch/MultiLingualSupport from wiki,but I still
have troubles to solve the problem quickly. Any one here can give me
some help? Any hint is welcome,  Thanks!!!

Re: List owner?

Posted by Sami Siren <ss...@gmail.com>.
Owner can be reached at nutch-user-owner@lucene.apache.org.

What kind of error are you experiencing (if any)?

--
 Sami Siren

James Phillips wrote:
> Can somebody tell me how to contact the owner of this list? I have tried
> on COUNTLESS occasions to remove myself using
> nutch-user-unsubscribe@lucene.apache.org but still keep on receiving
> e-mails.
> 
> Regards,
> 
> James Phillips
> 
> 


List owner?

Posted by James Phillips <ja...@keypot.com>.
Can somebody tell me how to contact the owner of this list? I have tried 
on COUNTLESS occasions to remove myself using 
nutch-user-unsubscribe@lucene.apache.org but still keep on receiving 
e-mails.

Regards,

James Phillips


Re: Nutch .81: the process to add a new analyzer ?

Posted by chee wu <ch...@gmail.com>.
Yes, it suite my requirement ! Thank you!

----- Original Message ----- 
From: "Sami Siren" <ss...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Sunday, January 07, 2007 10:41 PM
Subject: Re: Nutch .81: the process to add a new analyzer ?


> chee wu wrote:
>> Thanks Sami. I tried LanguageIndexingFilter,and it seems the LanguageIdentifier can't recognize Chinese now ?
> 
> No it doesn't. The list of languages can be checked here (*.ngp):
> http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/
> 
> You can build a ngp profile for chinese, but i think that in language
> identifiers current form it might not work that well.
> 
> You could also build an specialized identifier and add it as indexing
> filter - the most basic form could just blindly set lang to Chinese if
> that suits your use case.
> 
> --
> Sami Siren
> 
>> 
>> ----- Original Message ----- 
>> From: "Sami Siren" <ss...@gmail.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Sunday, January 07, 2007 5:47 PM
>> Subject: Re: Nutch .81: the process to add a new analyzer ?
>> 
>> 
>>> Chee Wu wrote:
>>>> Hi,
>>>>     I am trying to add a new analyzer for Chinese,and I found the
>>>> code below in the "org.apache.nutch.indexer.Indexer"
>>>>
>>>> The question of mine is:
>>>> For doc.get("lang"). Where and how can I  set the  "lang" property for
>>> lang field is put there by language identifier plugin if it is active.
>>>
>>> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html
>>>
>>> --
>>> Sami Siren
>>>
> 
>

Re: Nutch .81: the process to add a new analyzer ?

Posted by Sami Siren <ss...@gmail.com>.
e w wrote:
> If someone could explain the reasoning/motivation behind the orginal

Current n-gram identifier in nutch works pretty much ok for most of
western languages. It is also very simple and quite fast way of
identifying documents language. However is the charset of document is
not detected right results are not that good.

> identification method that would be helpful. Otherwise, I'd be happy to
> contribute my pseudo-NB hack and maybe even implement the correct version.

Go ahead and attach it to JIRA. I am sure there's plenty of people
interested in such thing.

--
 Sami Siren


Re: Nutch .81: the process to add a new analyzer ?

Posted by e w <ep...@gmail.com>.
> You can build a ngp profile for chinese, but i think that in language
> identifiers current form it might not work that well.


We re-wrote this plugin to doing a more naive-Bayes like identification
approach and got better results for Japanese.  It wasn't proper Naive Bayes
but did work better.

If someone could explain the reasoning/motivation behind the orginal
identification method that would be helpful. Otherwise, I'd be happy to
contribute my pseudo-NB hack and maybe even implement the correct version.

-Ed

Re: Nutch .81: the process to add a new analyzer ?

Posted by Sami Siren <ss...@gmail.com>.
chee wu wrote:
> Thanks Sami. I tried LanguageIndexingFilter,and it seems the LanguageIdentifier can't recognize Chinese now ?

No it doesn't. The list of languages can be checked here (*.ngp):
http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/

You can build a ngp profile for chinese, but i think that in language
identifiers current form it might not work that well.

You could also build an specialized identifier and add it as indexing
filter - the most basic form could just blindly set lang to Chinese if
that suits your use case.

--
 Sami Siren

> 
> ----- Original Message ----- 
> From: "Sami Siren" <ss...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Sunday, January 07, 2007 5:47 PM
> Subject: Re: Nutch .81: the process to add a new analyzer ?
> 
> 
>> Chee Wu wrote:
>>> Hi,
>>>     I am trying to add a new analyzer for Chinese,and I found the
>>> code below in the "org.apache.nutch.indexer.Indexer"
>>>
>>> The question of mine is:
>>> For doc.get("lang"). Where and how can I  set the  "lang" property for
>> lang field is put there by language identifier plugin if it is active.
>>
>> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html
>>
>> --
>> Sami Siren
>>


Re: Nutch .81: the process to add a new analyzer ?

Posted by chee wu <ch...@gmail.com>.
Thanks Sami. I tried LanguageIndexingFilter,and it seems the LanguageIdentifier can't recognize Chinese now ?

----- Original Message ----- 
From: "Sami Siren" <ss...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Sunday, January 07, 2007 5:47 PM
Subject: Re: Nutch .81: the process to add a new analyzer ?


> Chee Wu wrote:
>> Hi,
>>     I am trying to add a new analyzer for Chinese,and I found the
>> code below in the "org.apache.nutch.indexer.Indexer"
>> 
>> The question of mine is:
>> For doc.get("lang"). Where and how can I  set the  "lang" property for
> 
> lang field is put there by language identifier plugin if it is active.
> 
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html
> 
> --
> Sami Siren
>

Re: Nutch .81: the process to add a new analyzer ?

Posted by Sami Siren <ss...@gmail.com>.
Chee Wu wrote:
> Hi,
>     I am trying to add a new analyzer for Chinese,and I found the
> code below in the "org.apache.nutch.indexer.Indexer"
> 
> The question of mine is:
> For doc.get("lang"). Where and how can I  set the  "lang" property for

lang field is put there by language identifier plugin if it is active.

http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html

--
 Sami Siren