You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Yann Levreau <ya...@gmail.com> on 2014/04/02 17:42:17 UTC

Add Field to crawled content for indexing

Hello,

Maybe this is the wrong place to post a request so forgive me, but I really
need some help (Nutch 2.2.1) :

I need to add a new field to be indexed by ElasticSearch.

in 1.7, we had :
The HtmlParseFilter extension with :
ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
*filter
<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
(Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html>
content,
ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
parseResult,
HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html>
metaTags,
DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
 doc)

The IndexingFilter extension with :
NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
*filter
<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
(NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
doc,
Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
parse,
org.apache.hadoop.io.Text url,
CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html>
datum,
Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
 inlinks)

All was ok to add field.

in 2.2.1 we have :
The ParseFilter extension :
  Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
*filter
<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
(String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
url,
WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
page,
Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
parse,
HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html>
metaTags,
DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
 doc)
In Parse type, we don't have "getData()" so we can't add new metadata.

The IndexingFilter extension :
NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
*filter
<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>*
(NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
doc,
String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
url,
WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
 page)
We don't have Parse type in parameter to add field to NutchDocument type.

So what is the new way to add custom field to index ? Maybe i miss
something ...
Thank you very much !

Re: Add Field to crawled content for indexing

Posted by Talat Uyarer <ta...@uyarer.com>.
In addtion to Sebastian's mail, 2.x has index-metadata filter if you want
to send any field which is in metadata to index, you just write its name on
configuration.

I recommend you look at index-metadata

Talat
2 Nis 2014 23:30 tarihinde "Sebastian Nagel" <wa...@googlemail.com>
yazdı:

> Hi Yann,
>
> > In Parse type, we don't have "getData()" so we can't add new metadata.
> ...
> > So what is the new way to add custom field to index ? Maybe i miss
> > something ...
>
> In 2.x data for custom fields can be added to the WebPage's metadata
> in ParseFilter via
>  page.putToMetadata(Utf8 key, ByteBuffer value)
> It's then read in IndexingFilter by
>  page.getFromMetadata(Utf8 key)
>
> Sebastian
>
> On 04/02/2014 05:42 PM, Yann Levreau wrote:
> > Hello,
> >
> > Maybe this is the wrong place to post a request so forgive me, but I
> really
> > need some help (Nutch 2.2.1) :
> >
> > I need to add a new field to be indexed by ElasticSearch.
> >
> > in 1.7, we had :
> > The HtmlParseFilter extension with :
> > ParseResult<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html
> >
> > *filter
> > <
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29
> >*
> > (Content<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html
> >
> > content,
> > ParseResult<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html
> >
> > parseResult,
> > HTMLMetaTags<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html
> >
> > metaTags,
> > DocumentFragment<
> http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true
> >
> >  doc)
> >
> > The IndexingFilter extension with :
> > NutchDocument<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html
> >
> > *filter
> > <
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29
> >*
> > (NutchDocument<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html
> >
> > doc,
> > Parse<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
> > parse,
> > org.apache.hadoop.io.Text url,
> > CrawlDatum<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html
> >
> > datum,
> > Inlinks<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
> >  inlinks)
> >
> > All was ok to add field.
> >
> > in 2.2.1 we have :
> > The ParseFilter extension :
> >   Parse<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> > *filter
> > <
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29
> >*
> > (String<
> http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
> >
> > url,
> > WebPage<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> > page,
> > Parse<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> > parse,
> > HTMLMetaTags<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html
> >
> > metaTags,
> > DocumentFragment<
> http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true
> >
> >  doc)
> > In Parse type, we don't have "getData()" so we can't add new metadata.
> >
> > The IndexingFilter extension :
> > NutchDocument<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html
> >
> > *filter
> > <
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29
> >*
> > (NutchDocument<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html
> >
> > doc,
> > String<
> http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
> >
> > url,
> > WebPage<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> >  page)
> > We don't have Parse type in parameter to add field to NutchDocument type.
> >
> > So what is the new way to add custom field to index ? Maybe i miss
> > something ...
> > Thank you very much !
> >
>
>

Re: Add Field to crawled content for indexing

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Yann,

> In Parse type, we don't have "getData()" so we can't add new metadata.
...
> So what is the new way to add custom field to index ? Maybe i miss
> something ...

In 2.x data for custom fields can be added to the WebPage's metadata
in ParseFilter via
 page.putToMetadata(Utf8 key, ByteBuffer value)
It's then read in IndexingFilter by
 page.getFromMetadata(Utf8 key)

Sebastian

On 04/02/2014 05:42 PM, Yann Levreau wrote:
> Hello,
> 
> Maybe this is the wrong place to post a request so forgive me, but I really
> need some help (Nutch 2.2.1) :
> 
> I need to add a new field to be indexed by ElasticSearch.
> 
> in 1.7, we had :
> The HtmlParseFilter extension with :
> ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
> *filter
> <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
> (Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html>
> content,
> ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
> parseResult,
> HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html>
> metaTags,
> DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
>  doc)
> 
> The IndexingFilter extension with :
> NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
> *filter
> <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
> (NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
> parse,
> org.apache.hadoop.io.Text url,
> CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html>
> datum,
> Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
>  inlinks)
> 
> All was ok to add field.
> 
> in 2.2.1 we have :
> The ParseFilter extension :
>   Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> *filter
> <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
> (String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
> url,
> WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> page,
> Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> parse,
> HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html>
> metaTags,
> DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
>  doc)
> In Parse type, we don't have "getData()" so we can't add new metadata.
> 
> The IndexingFilter extension :
> NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
> *filter
> <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>*
> (NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
> url,
> WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
>  page)
> We don't have Parse type in parameter to add field to NutchDocument type.
> 
> So what is the new way to add custom field to index ? Maybe i miss
> something ...
> Thank you very much !
>