You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt MacDonald <ma...@nearbyfyi.com> on 2012/09/10 03:29:34 UTC

Boilerpipe and Nutch 2.x ?

Hi,

I've been looking at 2.x source code, JIRA and the mailing list for
information about Boilerpipe and Nutch 2.x. I can see that the
boilerpipe.jar file is included in the Tika plugin.xml file: <library
name="boilerpipe-1.1.0.jar"/>. I also see two jira tickets talking
about boilerpipe in Nutch 1.6:

* https://issues.apache.org/jira/browse/NUTCH-961
* https://issues.apache.org/jira/browse/NUTCH-1233

I also see that Tika 1.1 is using Boilerpipe:
http://tika.apache.org/1.1/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html

I've searched the mailing lists and code looking for what
configuration options I need to setup so that when HTML/XHTML
documents are parsed that Tika with Boilerpipe and a specific
Extractor is being used. I have added the following to nutch-site.xml:

<property>
  <name>tika.use_boilerpipe</name>
  <value>true</value>
</property>
<property>
  <name>tika.boilerpipe.extractor</name>
  <value>ArticleExtractor</value>
</property>

And in parse-plugins.xml I have the following:

<mimeType name="*">
  <plugin id="parse-tika" />
</mimeType>
<mimeType name="text/html">
  <plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
  <plugin id="parse-tika" />
</mimeType>

When I run my crawl it isn't clear that the Tika parser is being used
for text/html application/xhtml+xml and when looking at the extracted
content from the pages that I am crawling I'm seeing lots of
shell/template/wrapper HTML. Questions:

1. Ideas about what I can do to confirm that the Tika parser is being used?
2. Is there a logging setting so that I know that Boilerpipe is being
used to parse the HTML/XHTML?
3. Can I change the Extractor Boilperpipe uses and if so how?
4. Any ideas about what I am missing in my configuration so that
Tika/Boilerpipe is being used to parse those documents?

Thanks,
Matt

Re: Boilerpipe and Nutch 2.x ?

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

To be absolutely sure that only Tika is used you should also remove the
parse-html plugin from plugin.includes. Make sure all references to the
parse-html plugin are removed from the parse-plugins.xml. (Looking at your
snippet it seems as this is the case).

With Tika itself or Boilerpipe I'm not really familiar. (Mostly I use
parse-html.)

Ferdy.

On Mon, Sep 10, 2012 at 3:29 AM, Matt MacDonald <ma...@nearbyfyi.com> wrote:

> Hi,
>
> I've been looking at 2.x source code, JIRA and the mailing list for
> information about Boilerpipe and Nutch 2.x. I can see that the
> boilerpipe.jar file is included in the Tika plugin.xml file: <library
> name="boilerpipe-1.1.0.jar"/>. I also see two jira tickets talking
> about boilerpipe in Nutch 1.6:
>
> * https://issues.apache.org/jira/browse/NUTCH-961
> * https://issues.apache.org/jira/browse/NUTCH-1233
>
> I also see that Tika 1.1 is using Boilerpipe:
>
> http://tika.apache.org/1.1/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html
>
> I've searched the mailing lists and code looking for what
> configuration options I need to setup so that when HTML/XHTML
> documents are parsed that Tika with Boilerpipe and a specific
> Extractor is being used. I have added the following to nutch-site.xml:
>
> <property>
>   <name>tika.use_boilerpipe</name>
>   <value>true</value>
> </property>
> <property>
>   <name>tika.boilerpipe.extractor</name>
>   <value>ArticleExtractor</value>
> </property>
>
> And in parse-plugins.xml I have the following:
>
> <mimeType name="*">
>   <plugin id="parse-tika" />
> </mimeType>
> <mimeType name="text/html">
>   <plugin id="parse-tika" />
> </mimeType>
> <mimeType name="application/xhtml+xml">
>   <plugin id="parse-tika" />
> </mimeType>
>
> When I run my crawl it isn't clear that the Tika parser is being used
> for text/html application/xhtml+xml and when looking at the extracted
> content from the pages that I am crawling I'm seeing lots of
> shell/template/wrapper HTML. Questions:
>
> 1. Ideas about what I can do to confirm that the Tika parser is being used?
> 2. Is there a logging setting so that I know that Boilerpipe is being
> used to parse the HTML/XHTML?
> 3. Can I change the Extractor Boilperpipe uses and if so how?
> 4. Any ideas about what I am missing in my configuration so that
> Tika/Boilerpipe is being used to parse those documents?
>
> Thanks,
> Matt
>