You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Canan GİRGİN <ca...@gmail.com> on 2013/04/05 12:24:50 UTC

Re: FetchSchedule and Metadata

Hi,

I wrote a custom FetchSchedule.:
*public class CustomDefaultFetchSchedule extends DefaultFetchSchedule

Than I try to use Metada Field. But this field is always null:
*ByteBuffer blang = page.getFromMetadata(new Utf8(Metadata.LANGUAGE));

For this reason I Override setField method in my
custom CustomDefaultFetchSchedule class:

  @Override
    public Set<WebPage.Field> getFields() {

        FIELDS.addAll(super.getFields());
        FIELDS.add(WebPage.Field.METADATA);
        return FIELDS;
    }

But there is nothing. And metadata field is still null.

When I add metadata field in GeneratorJob class , eveything is okey and
metadata field is not empty:
  static {
    FIELDS.add(WebPage.Field.FETCH_TIME);
    FIELDS.add(WebPage.Field.SCORE);
    FIELDS.add(WebPage.Field.STATUS);
    FIELDS.add(WebPage.Field.METADATA);
  }

I don't want to change nutch own source.
How can I use METADATA field in a custom fetchSchedule class?


Nutch 2.1 / HBASE




On Mon, Apr 1, 2013 at 4:57 PM, Canan GİRGİN <ca...@gmail.com> wrote:

> Hi All,
>
> I want to crawl only sites that their language is XXX. I wrote a
> ParseFilter for detect the language of sites and put data metadata column.
> I can prevent crawling outlinks, which site is none XXX language, with this
> plugin. But I can not prevent to re-crawling of main page. Is there any
> filter can I use? Is it possible with any FetchSchedule?(I need to use
> metadata column data for filtering url)
>
> Not: Content-Language or Accept-Language is not suitable for my case.
>
> Nutch2.1/Hbase
>