You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Canan GİRGİN <ca...@gmail.com> on 2013/04/05 12:24:50 UTC
Re: FetchSchedule and Metadata
Hi,
I wrote a custom FetchSchedule.:
*public class CustomDefaultFetchSchedule extends DefaultFetchSchedule
Than I try to use Metada Field. But this field is always null:
*ByteBuffer blang = page.getFromMetadata(new Utf8(Metadata.LANGUAGE));
For this reason I Override setField method in my
custom CustomDefaultFetchSchedule class:
@Override
public Set<WebPage.Field> getFields() {
FIELDS.addAll(super.getFields());
FIELDS.add(WebPage.Field.METADATA);
return FIELDS;
}
But there is nothing. And metadata field is still null.
When I add metadata field in GeneratorJob class , eveything is okey and
metadata field is not empty:
static {
FIELDS.add(WebPage.Field.FETCH_TIME);
FIELDS.add(WebPage.Field.SCORE);
FIELDS.add(WebPage.Field.STATUS);
FIELDS.add(WebPage.Field.METADATA);
}
I don't want to change nutch own source.
How can I use METADATA field in a custom fetchSchedule class?
Nutch 2.1 / HBASE
On Mon, Apr 1, 2013 at 4:57 PM, Canan GİRGİN <ca...@gmail.com> wrote:
> Hi All,
>
> I want to crawl only sites that their language is XXX. I wrote a
> ParseFilter for detect the language of sites and put data metadata column.
> I can prevent crawling outlinks, which site is none XXX language, with this
> plugin. But I can not prevent to re-crawling of main page. Is there any
> filter can I use? Is it possible with any FetchSchedule?(I need to use
> metadata column data for filtering url)
>
> Not: Content-Language or Accept-Language is not suitable for my case.
>
> Nutch2.1/Hbase
>