You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Yann Levreau <ya...@gmail.com> on 2014/04/02 17:42:17 UTC
Add Field to crawled content for indexing
Hello,
Maybe this is the wrong place to post a request so forgive me, but I really
need some help (Nutch 2.2.1) :
I need to add a new field to be indexed by ElasticSearch.
in 1.7, we had :
The HtmlParseFilter extension with :
ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
*filter
<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
(Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html>
content,
ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
parseResult,
HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html>
metaTags,
DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
doc)
The IndexingFilter extension with :
NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
*filter
<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
(NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
doc,
Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
parse,
org.apache.hadoop.io.Text url,
CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html>
datum,
Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
inlinks)
All was ok to add field.
in 2.2.1 we have :
The ParseFilter extension :
Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
*filter
<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
(String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
url,
WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
page,
Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
parse,
HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html>
metaTags,
DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
doc)
In Parse type, we don't have "getData()" so we can't add new metadata.
The IndexingFilter extension :
NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
*filter
<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>*
(NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
doc,
String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
url,
WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
page)
We don't have Parse type in parameter to add field to NutchDocument type.
So what is the new way to add custom field to index ? Maybe i miss
something ...
Thank you very much !
Re: Add Field to crawled content for indexing
Posted by Talat Uyarer <ta...@uyarer.com>.
In addtion to Sebastian's mail, 2.x has index-metadata filter if you want
to send any field which is in metadata to index, you just write its name on
configuration.
I recommend you look at index-metadata
Talat
2 Nis 2014 23:30 tarihinde "Sebastian Nagel" <wa...@googlemail.com>
yazdı:
> Hi Yann,
>
> > In Parse type, we don't have "getData()" so we can't add new metadata.
> ...
> > So what is the new way to add custom field to index ? Maybe i miss
> > something ...
>
> In 2.x data for custom fields can be added to the WebPage's metadata
> in ParseFilter via
> page.putToMetadata(Utf8 key, ByteBuffer value)
> It's then read in IndexingFilter by
> page.getFromMetadata(Utf8 key)
>
> Sebastian
>
> On 04/02/2014 05:42 PM, Yann Levreau wrote:
> > Hello,
> >
> > Maybe this is the wrong place to post a request so forgive me, but I
> really
> > need some help (Nutch 2.2.1) :
> >
> > I need to add a new field to be indexed by ElasticSearch.
> >
> > in 1.7, we had :
> > The HtmlParseFilter extension with :
> > ParseResult<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html
> >
> > *filter
> > <
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29
> >*
> > (Content<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html
> >
> > content,
> > ParseResult<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html
> >
> > parseResult,
> > HTMLMetaTags<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html
> >
> > metaTags,
> > DocumentFragment<
> http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true
> >
> > doc)
> >
> > The IndexingFilter extension with :
> > NutchDocument<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html
> >
> > *filter
> > <
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29
> >*
> > (NutchDocument<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html
> >
> > doc,
> > Parse<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
> > parse,
> > org.apache.hadoop.io.Text url,
> > CrawlDatum<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html
> >
> > datum,
> > Inlinks<
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
> > inlinks)
> >
> > All was ok to add field.
> >
> > in 2.2.1 we have :
> > The ParseFilter extension :
> > Parse<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> > *filter
> > <
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29
> >*
> > (String<
> http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
> >
> > url,
> > WebPage<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> > page,
> > Parse<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> > parse,
> > HTMLMetaTags<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html
> >
> > metaTags,
> > DocumentFragment<
> http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true
> >
> > doc)
> > In Parse type, we don't have "getData()" so we can't add new metadata.
> >
> > The IndexingFilter extension :
> > NutchDocument<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html
> >
> > *filter
> > <
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29
> >*
> > (NutchDocument<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html
> >
> > doc,
> > String<
> http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
> >
> > url,
> > WebPage<
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> > page)
> > We don't have Parse type in parameter to add field to NutchDocument type.
> >
> > So what is the new way to add custom field to index ? Maybe i miss
> > something ...
> > Thank you very much !
> >
>
>
Re: Add Field to crawled content for indexing
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Yann,
> In Parse type, we don't have "getData()" so we can't add new metadata.
...
> So what is the new way to add custom field to index ? Maybe i miss
> something ...
In 2.x data for custom fields can be added to the WebPage's metadata
in ParseFilter via
page.putToMetadata(Utf8 key, ByteBuffer value)
It's then read in IndexingFilter by
page.getFromMetadata(Utf8 key)
Sebastian
On 04/02/2014 05:42 PM, Yann Levreau wrote:
> Hello,
>
> Maybe this is the wrong place to post a request so forgive me, but I really
> need some help (Nutch 2.2.1) :
>
> I need to add a new field to be indexed by ElasticSearch.
>
> in 1.7, we had :
> The HtmlParseFilter extension with :
> ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
> *filter
> <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
> (Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html>
> content,
> ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
> parseResult,
> HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html>
> metaTags,
> DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
> doc)
>
> The IndexingFilter extension with :
> NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
> *filter
> <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
> (NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
> parse,
> org.apache.hadoop.io.Text url,
> CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html>
> datum,
> Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
> inlinks)
>
> All was ok to add field.
>
> in 2.2.1 we have :
> The ParseFilter extension :
> Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> *filter
> <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
> (String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
> url,
> WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> page,
> Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> parse,
> HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html>
> metaTags,
> DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
> doc)
> In Parse type, we don't have "getData()" so we can't add new metadata.
>
> The IndexingFilter extension :
> NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
> *filter
> <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>*
> (NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
> url,
> WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> page)
> We don't have Parse type in parameter to add field to NutchDocument type.
>
> So what is the new way to add custom field to index ? Maybe i miss
> something ...
> Thank you very much !
>