You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Savannah Beckett <sa...@yahoo.com> on 2010/08/27 01:41:59 UTC

How do I determine language of the document in Parse Filter?

Hi,
  How do I determine the language of the document inside a parse filter 
function?  I am writing a my own parse filter: 


 public ParseResult filter(Content content, ParseResult parseResult,
                              HTMLMetaTags metaTags, DocumentFragment doc) 

I am trying to do "doc.get("lang")", but compiler complained it cannot find 
symfol for Get( ) function in DocumentFragment interface.  


Thanks.



      

Re: How do I determine language of the document in Parse Filter?

Posted by Julien Nioche <li...@gmail.com>.
Hi,

The language is determined by the HTMLLanguageParser which is a ParseFilter
as well. You'll need to make sure that your parse filter is called after it
(have a look in nutch-default.xml for the exact name of the param). As you
can see in HTMLLanguageParser, the value is put in the parse metadata :

      parse.getData().getParseMeta().set(Metadata.LANGUAGE, lang);
>

Simply do something like this in your code (Nutch 1.x)

    Parse parse = parseResult.get(content.getUrl());
>     Metadata metadata = parse.getData().getParseMeta();
>     String lang = metadata.get(Metadata.LANGUAGE);


Note that the HTMLLanguageParser simply uses the language code returned in
the http header or specified in the HTML code. The statistical guessing of
the language is not done before the indexing.

HTH

Julien


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 27 August 2010 00:41, Savannah Beckett <sa...@yahoo.com>wrote:

> Hi,
>   How do I determine the language of the document inside a parse filter
> function?  I am writing a my own parse filter:
>
>
>  public ParseResult filter(Content content, ParseResult parseResult,
>                               HTMLMetaTags metaTags, DocumentFragment doc)
>
> I am trying to do "doc.get("lang")", but compiler complained it cannot find
> symfol for Get( ) function in DocumentFragment interface.
>
>
> Thanks.
>
>
>
>