You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by A Laxmi <a....@gmail.com> on 2013/07/12 17:15:06 UTC

Nutch(2.2.1) How to extract a proper snippet text from a crawled site to display under search result?

I could crawl a bunch of urls using Nutch 2.2.1 with data stored in MySQL
and I could index it using Solr. Now, when I want to display the search
results on the front-end(using 'ajax-solr'), I am not sure how to display a
snippet below the title just like the way google does.

Nutch crawler when it crawls a site, it grabs all the data on a site
including the text in a banner, navigation, etc into a field called
'text'(earlier it used to be 'content'). If I want to use that 'text'
column to serve as a snippet on the search results page, it looks odd as
the snipped looks something like this -

*Publications [Jump to the main content of this page]  Home Publications
Home Author's Corner All Publications Advanced Search Site Map   Search
Online Publications     Ordering printed copies. Electronic Mailing List :
Keep informed about our new publications. Technical Help : Problems or
questions with our site? *

As you see above sample snippet - it shows the text included in banner of a
site along with navigation '[Jump to the main content of this page] ' and
lot of unncessary information rather than the description of a site as a
snippet.

I have to crawl sites with a unknown/poor structure on which I have no
control. How to achieve displaying a proper snippet and less of garbage on
a search result snippet (something similar to snippet on google search
result )?

RE: Nutch(2.2.1) How to extract a proper snippet text from a crawled site to display under search result?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

This is always an interesting problem. You can either buy or build your own extraction software or be satisfied by what Boilerpipe has to offer. Tika has support for Boilerpipe and NUTCH-961 has a patch for 2.x as well enabling Boilerpipe.

https://issues.apache.org/jira/browse/NUTCH-961

Be careful, although Boilerpipe does a good job in general, it is not an all purpose library and sometimes does a bad job. If your pages are semi-well structured it will usually be good enough.

Cheers
 
-----Original message-----
> From:A Laxmi <a....@gmail.com>
> Sent: Friday 12th July 2013 17:15
> To: user@nutch.apache.org
> Subject: Nutch(2.2.1) How to extract a proper snippet text from a crawled site to display under search result?
> 
> I could crawl a bunch of urls using Nutch 2.2.1 with data stored in MySQL
> and I could index it using Solr. Now, when I want to display the search
> results on the front-end(using 'ajax-solr'), I am not sure how to display a
> snippet below the title just like the way google does.
> 
> Nutch crawler when it crawls a site, it grabs all the data on a site
> including the text in a banner, navigation, etc into a field called
> 'text'(earlier it used to be 'content'). If I want to use that 'text'
> column to serve as a snippet on the search results page, it looks odd as
> the snipped looks something like this -
> 
> *Publications [Jump to the main content of this page]  Home Publications
> Home Author's Corner All Publications Advanced Search Site Map   Search
> Online Publications     Ordering printed copies. Electronic Mailing List :
> Keep informed about our new publications. Technical Help : Problems or
> questions with our site? *
> 
> As you see above sample snippet - it shows the text included in banner of a
> site along with navigation '[Jump to the main content of this page] ' and
> lot of unncessary information rather than the description of a site as a
> snippet.
> 
> I have to crawl sites with a unknown/poor structure on which I have no
> control. How to achieve displaying a proper snippet and less of garbage on
> a search result snippet (something similar to snippet on google search
> result )?
>