You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zoltán Zvara <zo...@gmail.com> on 2017/07/26 17:18:09 UTC

After Parse extension point

Dear Community,

Looking for the extension point which executes after parse and before update.
Moreover, I would be happy to read further on how extension points are built up (in which order). My first impressions of Nutch is that it is highly under-documented, or existing documentation is outdated. I would be pleased to look into details how the plugin system works, further how extension points are controlled and ran by Nutch.

Best,
Zoltán

Re: After Parse extension point

Posted by Zoltán Zvara <zo...@gmail.com>.
Hi Yossi and Jorge,

Thanks for your detailed answer and guidance! I will look into the materials immediately.
We started to use Nutch 1.X extensively, and we would definitely contribute improvements back to the main code base if possible.

Zoltán
On 2017-07-27 13:15:24, Jorge Betancourt <be...@gmail.com> wrote:
Hi Zoltán,

You can take a look at [1] in there you could find some documentation,
although it says that was updated to version 1.8, we do not change the
extension points that often. You can also take a look at the code [2]
related to the plugin subsystem. It is true that the documentation is not
ideal, but looking at the code and at the tests can provide a really good
overview.

You didn't mention which version of Nutch you were using, depending on what
you're trying to do you'll need and HtmlParseFilter (that will allow you to
extract information out of the parsed HTML) and/or and IndexingFilter which
will let you customize the information before the document is sent to
Solr/ES (this is probably what you want).

I wrote a post about the IndexingFilter using a practical case some time
ago [3], you can take a look at it, it doesn't go too deep but could help,
also if you want to take a look at the code check [4] which is the version
that was merged into Nutch master.

We always welcome new contributions you could help improve the existing
documentation or adding new documentation on those parts that are less
documented.

[1] https://wiki.apache.org/nutch/AboutPlugins
[2]
https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin
[3]
https://jorgelbg.wordpress.com/2014/08/30/indexing-inlinks-and-outlinks-with-nutch-1-x/
[4] https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter


On Thu, Jul 27, 2017 at 11:30 AM Yossi Tamari wrote:

> Hi Zoltan,
>
> I think what you want is a HtmlParseFilter -
> https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html
> .
> I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html,
> and take a look at one of the included HtmlParseFilters, e.g.
> parsefilter-regex.
>
> If you have more specific questions, I may be able to help.
>
> Yossi.
>
> -----Original Message-----
> From: Zoltán Zvara [mailto:zoltan.zvara@gmail.com]
> Sent: 26 July 2017 20:18
> To: user@nutch.apache.org
> Subject: After Parse extension point
>
> Dear Community,
>
> Looking for the extension point which executes after parse and before
> update.
> Moreover, I would be happy to read further on how extension points are
> built up (in which order). My first impressions of Nutch is that it is
> highly under-documented, or existing documentation is outdated. I would be
> pleased to look into details how the plugin system works, further how
> extension points are controlled and ran by Nutch.
>
> Best,
> Zoltán
>
>

Re: After Parse extension point

Posted by Jorge Betancourt <be...@gmail.com>.
Hi Zoltán,

You can take a look at [1] in there you could find some documentation,
although it says that was updated to version 1.8, we do not change the
extension points that often. You can also take a look at the code [2]
related to the plugin subsystem. It is true that the documentation is not
ideal, but looking at the code and at the tests can provide a really good
overview.

You didn't mention which version of Nutch you were using, depending on what
you're trying to do you'll need and HtmlParseFilter (that will allow you to
extract information out of the parsed HTML) and/or and IndexingFilter which
will let you customize the information before the document is sent to
Solr/ES (this is probably what you want).

I wrote a post about the IndexingFilter using a practical case some time
ago [3], you can take a look at it, it doesn't go too deep but could help,
also if you want to take a look at the code check [4] which is the version
that was merged into Nutch master.

We always welcome new contributions you could help improve the existing
documentation or adding new documentation on those parts that are less
documented.

[1] https://wiki.apache.org/nutch/AboutPlugins
[2]
https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin
[3]
https://jorgelbg.wordpress.com/2014/08/30/indexing-inlinks-and-outlinks-with-nutch-1-x/
[4] https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter


On Thu, Jul 27, 2017 at 11:30 AM Yossi Tamari <yo...@pipl.com> wrote:

> Hi Zoltan,
>
> I think what you want is a HtmlParseFilter -
> https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html
> .
> I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html,
> and take a look at one of the included HtmlParseFilters, e.g.
> parsefilter-regex.
>
> If you have more specific questions, I may be able to help.
>
>         Yossi.
>
> -----Original Message-----
> From: Zoltán Zvara [mailto:zoltan.zvara@gmail.com]
> Sent: 26 July 2017 20:18
> To: user@nutch.apache.org
> Subject: After Parse extension point
>
> Dear Community,
>
> Looking for the extension point which executes after parse and before
> update.
> Moreover, I would be happy to read further on how extension points are
> built up (in which order). My first impressions of Nutch is that it is
> highly under-documented, or existing documentation is outdated. I would be
> pleased to look into details how the plugin system works, further how
> extension points are controlled and ran by Nutch.
>
> Best,
> Zoltán
>
>

RE: After Parse extension point

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Zoltan,

I think what you want is a HtmlParseFilter - https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html.
I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html, and take a look at one of the included HtmlParseFilters, e.g. parsefilter-regex.

If you have more specific questions, I may be able to help.

	Yossi.

-----Original Message-----
From: Zoltán Zvara [mailto:zoltan.zvara@gmail.com] 
Sent: 26 July 2017 20:18
To: user@nutch.apache.org
Subject: After Parse extension point

Dear Community,

Looking for the extension point which executes after parse and before update.
Moreover, I would be happy to read further on how extension points are built up (in which order). My first impressions of Nutch is that it is highly under-documented, or existing documentation is outdated. I would be pleased to look into details how the plugin system works, further how extension points are controlled and ran by Nutch.

Best,
Zoltán