You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by imran khan <im...@gmail.com> on 2013/07/29 11:25:24 UTC

Nutch HTML Parsers & tika-boilerpipe configuration

Greetings,

I am trying to understand the role/functionality of different html parsers
(parse-html and parse-tika) plugin in nutch 2.2.

My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has

<mimeType name="*">
  <plugin id="parse-tika" />
</mimeType>

<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>

        <mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>

So does it mean for parsing html pages "parse-html" plugin would be used ?
And to use Tika for parsing my html pages I would simply replace it with
"parse-tika" plugin ?

And if I want to remove the boilerplate text like menu, ads text etc. from
my 'content' field in nutch then I guess I have to use Tika with boilerpipe
?

Where can I configure nutch to use boilerpipe with Tika and other
extracters ? And is there any configuration in Tika/boilerpipe which would
automatically pick the right extractor for Tika for current Html page ?

Regards,
Imran

RE: Nutch HTML Parsers & tika-boilerpipe configuration

Posted by Markus Jelsma <ma...@openindex.io>.

Strange, please check the logs and perhaps restore default settings and config files. I'm very sure it works flawlessly on a vanilla Nutch. 
 
-----Original message-----
> From:Saravanakumar Karunanithi <ak...@gmail.com>
> Sent: Monday 29th July 2013 13:35
> To: user@nutch.apache.org
> Subject: Re: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> 
> after applying the patch, I tried the following command
> 
> *bin/nutch parsechecker -dumpText
> http://indiatoday.intoday.in/story/google-unveils-android-4.3-jelly-bean-operating-system/1/296208.html
> *
> Which resulted the expected the results, but when I run the crawler, I get
> ~98% Error while Parsing,
> 
> I get the following error
> 
> *"Unable to successfully parse content URL*"
> 
> 
> 
> On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Simple, only use parse-tika and patch with NUTCH-961.
> > https://issues.apache.org/jira/browse/NUTCH-961
> >
> > Extractor algorithms are fixed, it is not possible to preanalyze a page
> > and select an extractor accordingly.
> >
> >
> > -----Original message-----
> > > From:imran khan <im...@gmail.com>
> > > Sent: Monday 29th July 2013 11:25
> > > To: user@nutch.apache.org
> > > Subject: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> > >
> > > Greetings,
> > >
> > > I am trying to understand the role/functionality of different html
> > parsers
> > > (parse-html and parse-tika) plugin in nutch 2.2.
> > >
> > > My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
> > >
> > > <mimeType name="*">
> > >   <plugin id="parse-tika" />
> > > </mimeType>
> > >
> > > <mimeType name="text/html">
> > > <plugin id="parse-html" />
> > > </mimeType>
> > >
> > >         <mimeType name="application/xhtml+xml">
> > > <plugin id="parse-html" />
> > > </mimeType>
> > >
> > > So does it mean for parsing html pages "parse-html" plugin would be used
> > ?
> > > And to use Tika for parsing my html pages I would simply replace it with
> > > "parse-tika" plugin ?
> > >
> > > And if I want to remove the boilerplate text like menu, ads text etc.
> > from
> > > my 'content' field in nutch then I guess I have to use Tika with
> > boilerpipe
> > > ?
> > >
> > > Where can I configure nutch to use boilerpipe with Tika and other
> > > extracters ? And is there any configuration in Tika/boilerpipe which
> > would
> > > automatically pick the right extractor for Tika for current Html page ?
> > >
> > > Regards,
> > > Imran
> > >
> >
> 
> 
> 
> -- 
> Thanks & Regards,
> Saravanakumar Karunanithi
>

Re: Nutch HTML Parsers & tika-boilerpipe configuration

Posted by Saravanakumar Karunanithi <ak...@gmail.com>.

after applying the patch, I tried the following command

*bin/nutch parsechecker -dumpText
http://indiatoday.intoday.in/story/google-unveils-android-4.3-jelly-bean-operating-system/1/296208.html
*
Which resulted the expected the results, but when I run the crawler, I get
~98% Error while Parsing,

I get the following error

*"Unable to successfully parse content URL*"



On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Simple, only use parse-tika and patch with NUTCH-961.
> https://issues.apache.org/jira/browse/NUTCH-961
>
> Extractor algorithms are fixed, it is not possible to preanalyze a page
> and select an extractor accordingly.
>
>
> -----Original message-----
> > From:imran khan <im...@gmail.com>
> > Sent: Monday 29th July 2013 11:25
> > To: user@nutch.apache.org
> > Subject: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> >
> > Greetings,
> >
> > I am trying to understand the role/functionality of different html
> parsers
> > (parse-html and parse-tika) plugin in nutch 2.2.
> >
> > My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
> >
> > <mimeType name="*">
> >   <plugin id="parse-tika" />
> > </mimeType>
> >
> > <mimeType name="text/html">
> > <plugin id="parse-html" />
> > </mimeType>
> >
> >         <mimeType name="application/xhtml+xml">
> > <plugin id="parse-html" />
> > </mimeType>
> >
> > So does it mean for parsing html pages "parse-html" plugin would be used
> ?
> > And to use Tika for parsing my html pages I would simply replace it with
> > "parse-tika" plugin ?
> >
> > And if I want to remove the boilerplate text like menu, ads text etc.
> from
> > my 'content' field in nutch then I guess I have to use Tika with
> boilerpipe
> > ?
> >
> > Where can I configure nutch to use boilerpipe with Tika and other
> > extracters ? And is there any configuration in Tika/boilerpipe which
> would
> > automatically pick the right extractor for Tika for current Html page ?
> >
> > Regards,
> > Imran
> >
>



-- 
Thanks & Regards,
Saravanakumar Karunanithi

RE: Nutch HTML Parsers & tika-boilerpipe configuration

Posted by Markus Jelsma <ma...@openindex.io>.

Simple, only use parse-tika and patch with NUTCH-961.
https://issues.apache.org/jira/browse/NUTCH-961

Extractor algorithms are fixed, it is not possible to preanalyze a page and select an extractor accordingly.
 
 
-----Original message-----
> From:imran khan <im...@gmail.com>
> Sent: Monday 29th July 2013 11:25
> To: user@nutch.apache.org
> Subject: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> 
> Greetings,
> 
> I am trying to understand the role/functionality of different html parsers
> (parse-html and parse-tika) plugin in nutch 2.2.
> 
> My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
> 
> <mimeType name="*">
>   <plugin id="parse-tika" />
> </mimeType>
> 
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> 
>         <mimeType name="application/xhtml+xml">
> <plugin id="parse-html" />
> </mimeType>
> 
> So does it mean for parsing html pages "parse-html" plugin would be used ?
> And to use Tika for parsing my html pages I would simply replace it with
> "parse-tika" plugin ?
> 
> And if I want to remove the boilerplate text like menu, ads text etc. from
> my 'content' field in nutch then I guess I have to use Tika with boilerpipe
> ?
> 
> Where can I configure nutch to use boilerpipe with Tika and other
> extracters ? And is there any configuration in Tika/boilerpipe which would
> automatically pick the right extractor for Tika for current Html page ?
> 
> Regards,
> Imran
>