You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2017/11/15 13:57:45 UTC

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Hello.

I am using tika boilerpipe with good results in aproximately 2000 websites. 
Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration
and tell us.

make sure that tika plugin is activated in plugin.included property then check:

***********************************************
Use tika parser instead of parse-html.

parse-plugins.xml

<mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>
***********************************************

***********************************************
nutch-site.xml
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>
****************************************












----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: user@nutch.apache.org
Enviados: Martes, 14 de Noviembre 2017 17:40:08
Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.

Regards,
Markus

-----Original message-----
> From:Rushikesh K <ru...@gmail.com>
> Sent: Tuesday 14th November 2017 23:30
> To: user@nutch.apache.org
> Cc: Sebastian Nagel <wa...@googlemail.com>; betancourt.jorge@gmail.com
> Subject: Re: Removing header,Footer and left menus while crawling
> 
> Hello,
> 
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
> 
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
> 
> I appreciate for your quick suggestions!
> 
> Thanks
> Rushikesh
> 
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> betancourt.jorge@gmail.com> wrote:
> 
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <ru...@gmail.com>
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
> 
> 
> 
> -- 
> Regards
> Rushikesh M
> .Net Developer
> 
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Posted by Rushikesh K <ru...@gmail.com>.

Hello,
Eyeris - Thanks for your response, i was able to make working with tika
boilerpipe but now i have a different problem ,some of the crawled pages
doesn't have the expected data
For some pages it brings back only the *Title *and skips all the content i
am not sure in what special cases does this do.But in my case i have two
problems now
1. when my page has a image and 1 or 2 lines of text it doesn't get those
lines of data.(the data is in the <p> tag)
2.why is it adding *Title* to the starting of the *content* is there a way
not to include that.

For example see the following image for the first URL it came back with out
any date

[image: Inline image 1]

On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <er...@uci.cu>
wrote:

> Hello.
>
> I am using tika boilerpipe with good results in aproximately 2000 websites.
> Rushikesh if tika boilerpipe is not working for you maybe it is because
> you don´t are parsing documents with tika. please check this configuration
> and tell us.
>
> make sure that tika plugin is activated in plugin.included property then
> check:
>
> ***********************************************
> Use tika parser instead of parse-html.
>
> parse-plugins.xml
>
> <mimeType name="text/html">
>                 <plugin id="parse-tika" />
>         </mimeType>
>
>         <mimeType name="application/xhtml+xml">
>                 <plugin id="parse-tika" />
>         </mimeType>
> ***********************************************
>
> ***********************************************
> nutch-site.xml
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>
> <property>
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> ****************************************
>
>
>
>
>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <ma...@openindex.io>
> Para: user@nutch.apache.org
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having
> trouble getting it configured - it is really just setting a boolean value.
> Or does it work, but not to your satisfaction?
>
> The Bayan solution should work, theoretically, but just with a lot of
> tedious manual per-site configuration.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Rushikesh K <ru...@gmail.com>
> > Sent: Tuesday 14th November 2017 23:30
> > To: user@nutch.apache.org
> > Cc: Sebastian Nagel <wa...@googlemail.com>;
> betancourt.jorge@gmail.com
> > Subject: Re: Removing header,Footer and left menus while crawling
> >
> > Hello,
> >
> > *Jorge*
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> > tried configuring Tika boilerpipe with this version but this doesn't work
> > for me.As you suggested to use own parser ,i am not a java developer by
> > chance.
> > By chance if you or anyone in the community has a working file ,it would
> be
> > great if you can share it because there are many people facing with this
> > issue (i came to know this when i googled).
> >
> > Mark Vega
> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is
> also
> > not working.we followed the same steps.I can share the changes if you
> want
> > to take a look.
> >
> > I appreciate for your quick suggestions!
> >
> > Thanks
> > Rushikesh
> >
> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> > betancourt.jorge@gmail.com> wrote:
> >
> > > Hello Rushikesh,
> > >
> > > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13,
> then you
> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > > need to enable this feature with:
> > >
> > > <property>
> > >   <name>tika.extractor</name>
> > >   <value>boilerpipe</value>
> > >   <description>
> > >   Which text extraction algorithm to use. Valid values are: boilerpipe
> or
> > > none.
> > >   </description>
> > > </property>
> > >
> > > And configure the proper extractor with
> > > the tika.extractor.boilerpipe.algorithm setting.
> > >
> > > This is not a perfect solution, but I've used it successfully in the
> past,
> > > of course, your results will depend on how is the structure (markup of
> the
> > > website).
> > >
> > > Other option could be to implement your own parser if you need to have
> more
> > > control over what to include/exclude from the HTML. You can take a
> look at
> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 which
> contains
> > > some info and old patches.
> > >
> > > Best Regards,
> > > Jorge
> > >
> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <rushikeshmodem3@gmail.com
> >
> > > wrote:
> > >
> > > > Hello Sebastian,
> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for
> crawling
> > > > our website and we are happy with the search results  but we had
> > > > requirement to skip the header footer and left menus and some other
> parts
> > > > of the page, can you please guide how can we exclude those parts.i
> was
> > > > trying various ways on google but nothing works for me.
> > > >
> > > > Appreciate for your help in Advance!
> > > > --
> > > > Regards
> > > > Rushikesh M
> > > > .Net Developer
> > > >
> > >
> >
> >
> >
> > --
> > Regards
> > Rushikesh M
> > .Net Developer
> >
> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a
> la Revolución
> 2002-2017
>



-- 
Regards
Rushikesh M
.Net Developer

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Posted by Markus Jelsma <ma...@openindex.io>.

You could do that, but you would need to fiddle around in TikaParser.java. Using TeeContentHandler you can add both the normal ContentHandler, and the Boilerpipe version.

 
 
-----Original message-----
> From:Michael Coffey <mc...@yahoo.com.INVALID>
> Sent: Wednesday 15th November 2017 20:34
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
> 
> I am curious, is it possible to send boilerpipe output to Solr as a separate "plaintext" field, in addition to the usual "content" field (rather than replacing it)? If so, would someone please give an overview of how to do it?
>

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

I am curious, is it possible to send boilerpipe output to Solr as a separate "plaintext" field, in addition to the usual "content" field (rather than replacing it)? If so, would someone please give an overview of how to do it?