You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by dogrdon <dg...@planning.org> on 2013/07/17 21:35:46 UTC

Nutch how to crawl but not index the site navigation (w/ Solr)

admittedly this is a cross-post from stackoverflow, but I don't know if there
are a whole lot of Nutch folks over there.

My question is about crawling HTML navigation menus, but not indexing the
text for those links in Solr.

While I have seen some older discussions from several years ago about making
this an option in later development, but I am not really finding anything
via searching that gives a good indication of how one might exlude site
navigation menu content from the content that Nutch indexes to Solr during a
crawl.

That is, I am seeing the navigation menu text in all content that is getting
indexed and this damages search because then all content will have the same
text in it. Obviously I want to keep using the site navigation for crawling,
but I don't want it indexed. Is there a best practice for accomplishing this
with Nutch? Like a way to wrap the navigation in some kind of tag , for
example?

I am new to Nutch (obviously) so I don't know the best place that this would
be accomplished.

thanks very much.




--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch how to crawl but not index the site navigation (w/ Solr)

Posted by Joe Zhang <sm...@gmail.com>.

This is good to know, Markus. This presents some challenge:

- In wide-spectrum crawling, it is hard to know the page structure ahead of
time.
- Even if we do, how do we specify something conditional in nutch-site.xml?


On Wed, Jul 17, 2013 at 2:10 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Yes! Boilerpipe is the best open source alternative and has a working
> patch for Nutch! There are also some other open source extraction toolkits
> but they have not been ported to Tika or do not directly work with SAX
> ContentHandlers (usable in Tika) so they would require some work there plus
> integration in Nutch.
>
> The problem with Boilerpipe is that is has different extractors so you
> must use ArticleExtractor for article pages but Canola for pages with many
> blocks.
>
> -----Original message-----
> > From:Sebastian Nagel <wa...@googlemail.com>
> > Sent: Wednesday 17th July 2013 22:37
> > To: user@nutch.apache.org
> > Subject: Re: Nutch how to crawl but not index the site navigation (w/
> Solr)
> >
> > Hi,
> >
> > the answer depends on the use case:
> >
> > 1. remove navigation for any page while crawling lot of sites: see
> > NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes
> > it fails.
> >
> > 2. for a couple of sites you have control or you know well:
> > implement a parse filter plugin
> > (
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html).
> Method "filter"
> > then should return a ParseResult
> > with replaced ParseText. To cleanse ParseText you have to construct the
> plain
> > text from DOM anew while skipping certain navigation tags (by element,
> class name, or id).
> > See also:
> >
> http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
> >  http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html
> >
> > Cheers,
> > Sebastian
> >
> >
> > On 07/17/2013 09:35 PM, dogrdon wrote:
> > > admittedly this is a cross-post from stackoverflow, but I don't know
> if there
> > > are a whole lot of Nutch folks over there.
> > >
> > > My question is about crawling HTML navigation menus, but not indexing
> the
> > > text for those links in Solr.
> > >
> > > While I have seen some older discussions from several years ago about
> making
> > > this an option in later development, but I am not really finding
> anything
> > > via searching that gives a good indication of how one might exlude site
> > > navigation menu content from the content that Nutch indexes to Solr
> during a
> > > crawl.
> > >
> > > That is, I am seeing the navigation menu text in all content that is
> getting
> > > indexed and this damages search because then all content will have the
> same
> > > text in it. Obviously I want to keep using the site navigation for
> crawling,
> > > but I don't want it indexed. Is there a best practice for
> accomplishing this
> > > with Nutch? Like a way to wrap the navigation in some kind of tag , for
> > > example?
> > >
> > > I am new to Nutch (obviously) so I don't know the best place that this
> would
> > > be accomplished.
> > >
> > > thanks very much.
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
> >
>

RE: Nutch how to crawl but not index the site navigation (w/ Solr)

Posted by Markus Jelsma <ma...@openindex.io>.

Yes! Boilerpipe is the best open source alternative and has a working patch for Nutch! There are also some other open source extraction toolkits but they have not been ported to Tika or do not directly work with SAX ContentHandlers (usable in Tika) so they would require some work there plus integration in Nutch.

The problem with Boilerpipe is that is has different extractors so you must use ArticleExtractor for article pages but Canola for pages with many blocks.
 
-----Original message-----
> From:Sebastian Nagel <wa...@googlemail.com>
> Sent: Wednesday 17th July 2013 22:37
> To: user@nutch.apache.org
> Subject: Re: Nutch how to crawl but not index the site navigation (w/ Solr)
> 
> Hi,
> 
> the answer depends on the use case:
> 
> 1. remove navigation for any page while crawling lot of sites: see
> NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes
> it fails.
> 
> 2. for a couple of sites you have control or you know well:
> implement a parse filter plugin
> (http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html). Method "filter"
> then should return a ParseResult
> with replaced ParseText. To cleanse ParseText you have to construct the plain
> text from DOM anew while skipping certain navigation tags (by element, class name, or id).
> See also:
>  http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
>  http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html
> 
> Cheers,
> Sebastian
> 
> 
> On 07/17/2013 09:35 PM, dogrdon wrote:
> > admittedly this is a cross-post from stackoverflow, but I don't know if there
> > are a whole lot of Nutch folks over there.
> > 
> > My question is about crawling HTML navigation menus, but not indexing the
> > text for those links in Solr.
> > 
> > While I have seen some older discussions from several years ago about making
> > this an option in later development, but I am not really finding anything
> > via searching that gives a good indication of how one might exlude site
> > navigation menu content from the content that Nutch indexes to Solr during a
> > crawl.
> > 
> > That is, I am seeing the navigation menu text in all content that is getting
> > indexed and this damages search because then all content will have the same
> > text in it. Obviously I want to keep using the site navigation for crawling,
> > but I don't want it indexed. Is there a best practice for accomplishing this
> > with Nutch? Like a way to wrap the navigation in some kind of tag , for
> > example?
> > 
> > I am new to Nutch (obviously) so I don't know the best place that this would
> > be accomplished.
> > 
> > thanks very much.
> > 
> > 
> > 
> > 
> > --
> > View this message in context: http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> > 
> 
>

Re: Nutch how to crawl but not index the site navigation (w/ Solr)

Posted by dogrdon <dg...@planning.org>.

Thanks Sebastian, 

I think I will try looking into the HtmlParseFilter since we do have control
over the content we are crawling and indexing. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702p4079169.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch how to crawl but not index the site navigation (w/ Solr)

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

the answer depends on the use case:

1. remove navigation for any page while crawling lot of sites: see
NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes
it fails.

2. for a couple of sites you have control or you know well:
implement a parse filter plugin
(http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html). Method "filter"
then should return a ParseResult
with replaced ParseText. To cleanse ParseText you have to construct the plain
text from DOM anew while skipping certain navigation tags (by element, class name, or id).
See also:
 http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
 http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html

Cheers,
Sebastian

On 07/17/2013 09:35 PM, dogrdon wrote:
> admittedly this is a cross-post from stackoverflow, but I don't know if there
> are a whole lot of Nutch folks over there.
> 
> My question is about crawling HTML navigation menus, but not indexing the
> text for those links in Solr.
> 
> While I have seen some older discussions from several years ago about making
> this an option in later development, but I am not really finding anything
> via searching that gives a good indication of how one might exlude site
> navigation menu content from the content that Nutch indexes to Solr during a
> crawl.
> 
> That is, I am seeing the navigation menu text in all content that is getting
> indexed and this damages search because then all content will have the same
> text in it. Obviously I want to keep using the site navigation for crawling,
> but I don't want it indexed. Is there a best practice for accomplishing this
> with Nutch? Like a way to wrap the navigation in some kind of tag , for
> example?
> 
> I am new to Nutch (obviously) so I don't know the best place that this would
> be accomplished.
> 
> thanks very much.
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>