You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Richardson, Jacquelyn F." <fl...@ornl.gov> on 2015/03/26 16:19:52 UTC

Ignore navigation during index

Hi,

Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process?  Specifically I do not want the information in the navigation or footer to be indexed.  My environment is Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.

Any assistance will be greatly appreciated.

Thanks,
Jackie


Re: [MASSMAIL]RE: [MASSMAIL]RE: Ignore navigation during index

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
Glad it helped!

Regards,

----- Original Message -----
From: "Jacquelyn F. Richardson" <fl...@ornl.gov>
To: user@nutch.apache.org
Sent: Friday, March 27, 2015 11:04:11 AM
Subject: [MASSMAIL]RE: [MASSMAIL]RE: Ignore navigation during index

Hi Jorge,

I was able to do what you suggested below and with success!  Thanks so much for the help!

Jackie

-----Original Message-----
From: Jorge Luis Betancourt González [mailto:jlbetancourt@uci.cu] 
Sent: Thursday, March 26, 2015 3:01 PM
To: user@nutch.apache.org
Subject: Re: [MASSMAIL]RE: Ignore navigation during index

This patch that you mention should work nicely as long as you can provide the tags that you want to be excluded, so if is an internal Intranet or some sites that don't change a lot this should work. The Boilerpipe techinque suggested by Markus is a more general solution as it uses a library that it uses some clever techniques to distinguish what is actually content and what is "noise" in the webpage. The choice is yours!

As for applying the patches, you should checkout the source code for the version you're using and then apply the patch in the root of the checkout code, this command should do the trick (the patch file attached to the should be downloaded).

patch -p0 < ~/Downloads/NUTCH-1928v5.patch

Afterwards you just need to compile a new binary from the patched source following the instructions in the README file.

Regards,

----- Original Message -----
From: "Jacquelyn F. Richardson" <fl...@ornl.gov>
To: user@nutch.apache.org
Sent: Thursday, March 26, 2015 11:57:41 AM
Subject: [MASSMAIL]RE: Ignore navigation during index

Hi Markus,

Thanks for the reply.  While waiting I found this:
https://issues.apache.org/jira/browse/NUTCH-585

Are you familiar with this patch?  How does this compare with your suggestion?

There are three attachments on the page.  Which is the correct patch?

I have never applied a patch to nutch before.  Could you point me in the right direction as far as instructions for applying a patch to my environment?

Jackie

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Thursday, March 26, 2015 11:33 AM
To: user@nutch.apache.org
Subject: RE: Ignore navigation during index

Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably.
https://issues.apache.org/jira/browse/NUTCH-961

Markus
 
 
-----Original message-----
> From:Richardson, Jacquelyn F. <fl...@ornl.gov>
> Sent: Thursday 26th March 2015 16:20
> To: user@nutch.apache.org
> Subject: Ignore navigation during index
> 
> Hi,
> 
> Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process?  Specifically I do not want the information in the navigation or footer to be indexed.  My environment is Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
> 
> Any assistance will be greatly appreciated.
> 
> Thanks,
> Jackie
> 
> 


RE: [MASSMAIL]RE: Ignore navigation during index

Posted by "Richardson, Jacquelyn F." <fl...@ornl.gov>.
Hi Jorge,

I was able to do what you suggested below and with success!  Thanks so much for the help!

Jackie

-----Original Message-----
From: Jorge Luis Betancourt González [mailto:jlbetancourt@uci.cu] 
Sent: Thursday, March 26, 2015 3:01 PM
To: user@nutch.apache.org
Subject: Re: [MASSMAIL]RE: Ignore navigation during index

This patch that you mention should work nicely as long as you can provide the tags that you want to be excluded, so if is an internal Intranet or some sites that don't change a lot this should work. The Boilerpipe techinque suggested by Markus is a more general solution as it uses a library that it uses some clever techniques to distinguish what is actually content and what is "noise" in the webpage. The choice is yours!

As for applying the patches, you should checkout the source code for the version you're using and then apply the patch in the root of the checkout code, this command should do the trick (the patch file attached to the should be downloaded).

patch -p0 < ~/Downloads/NUTCH-1928v5.patch

Afterwards you just need to compile a new binary from the patched source following the instructions in the README file.

Regards,

----- Original Message -----
From: "Jacquelyn F. Richardson" <fl...@ornl.gov>
To: user@nutch.apache.org
Sent: Thursday, March 26, 2015 11:57:41 AM
Subject: [MASSMAIL]RE: Ignore navigation during index

Hi Markus,

Thanks for the reply.  While waiting I found this:
https://issues.apache.org/jira/browse/NUTCH-585

Are you familiar with this patch?  How does this compare with your suggestion?

There are three attachments on the page.  Which is the correct patch?

I have never applied a patch to nutch before.  Could you point me in the right direction as far as instructions for applying a patch to my environment?

Jackie

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Thursday, March 26, 2015 11:33 AM
To: user@nutch.apache.org
Subject: RE: Ignore navigation during index

Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably.
https://issues.apache.org/jira/browse/NUTCH-961

Markus
 
 
-----Original message-----
> From:Richardson, Jacquelyn F. <fl...@ornl.gov>
> Sent: Thursday 26th March 2015 16:20
> To: user@nutch.apache.org
> Subject: Ignore navigation during index
> 
> Hi,
> 
> Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process?  Specifically I do not want the information in the navigation or footer to be indexed.  My environment is Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
> 
> Any assistance will be greatly appreciated.
> 
> Thanks,
> Jackie
> 
> 


Re: [MASSMAIL]RE: Ignore navigation during index

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
This patch that you mention should work nicely as long as you can provide the tags that you want to be excluded, so if is an internal Intranet or some sites that don't change a lot this should work. The Boilerpipe techinque suggested by Markus is a more general solution as it uses a library that it uses some clever techniques to distinguish what is actually content and what is "noise" in the webpage. The choice is yours!

As for applying the patches, you should checkout the source code for the version you're using and then apply the patch in the root of the checkout code, this command should do the trick (the patch file attached to the should be downloaded).

patch -p0 < ~/Downloads/NUTCH-1928v5.patch

Afterwards you just need to compile a new binary from the patched source following the instructions in the README file.

Regards,

----- Original Message -----
From: "Jacquelyn F. Richardson" <fl...@ornl.gov>
To: user@nutch.apache.org
Sent: Thursday, March 26, 2015 11:57:41 AM
Subject: [MASSMAIL]RE: Ignore navigation during index

Hi Markus,

Thanks for the reply.  While waiting I found this:
https://issues.apache.org/jira/browse/NUTCH-585

Are you familiar with this patch?  How does this compare with your suggestion?

There are three attachments on the page.  Which is the correct patch?

I have never applied a patch to nutch before.  Could you point me in the right direction as far as instructions for applying a patch to my environment?

Jackie

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Thursday, March 26, 2015 11:33 AM
To: user@nutch.apache.org
Subject: RE: Ignore navigation during index

Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably.
https://issues.apache.org/jira/browse/NUTCH-961

Markus
 
 
-----Original message-----
> From:Richardson, Jacquelyn F. <fl...@ornl.gov>
> Sent: Thursday 26th March 2015 16:20
> To: user@nutch.apache.org
> Subject: Ignore navigation during index
> 
> Hi,
> 
> Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process?  Specifically I do not want the information in the navigation or footer to be indexed.  My environment is Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
> 
> Any assistance will be greatly appreciated.
> 
> Thanks,
> Jackie
> 
> 

RE: Ignore navigation during index

Posted by "Richardson, Jacquelyn F." <fl...@ornl.gov>.
Hi Markus,

Thanks for the reply.  While waiting I found this:
https://issues.apache.org/jira/browse/NUTCH-585

Are you familiar with this patch?  How does this compare with your suggestion?

There are three attachments on the page.  Which is the correct patch?

I have never applied a patch to nutch before.  Could you point me in the right direction as far as instructions for applying a patch to my environment?

Jackie

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Thursday, March 26, 2015 11:33 AM
To: user@nutch.apache.org
Subject: RE: Ignore navigation during index

Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably.
https://issues.apache.org/jira/browse/NUTCH-961

Markus
 
 
-----Original message-----
> From:Richardson, Jacquelyn F. <fl...@ornl.gov>
> Sent: Thursday 26th March 2015 16:20
> To: user@nutch.apache.org
> Subject: Ignore navigation during index
> 
> Hi,
> 
> Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process?  Specifically I do not want the information in the navigation or footer to be indexed.  My environment is Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
> 
> Any assistance will be greatly appreciated.
> 
> Thanks,
> Jackie
> 
> 


RE: Ignore navigation during index

Posted by Markus Jelsma <ma...@openindex.io>.
Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably.
https://issues.apache.org/jira/browse/NUTCH-961

Markus
 
 
-----Original message-----
> From:Richardson, Jacquelyn F. <fl...@ornl.gov>
> Sent: Thursday 26th March 2015 16:20
> To: user@nutch.apache.org
> Subject: Ignore navigation during index
> 
> Hi,
> 
> Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process?  Specifically I do not want the information in the navigation or footer to be indexed.  My environment is Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
> 
> Any assistance will be greatly appreciated.
> 
> Thanks,
> Jackie
> 
> 

Re: Ignore navigation during index

Posted by remi tassing <ta...@gmail.com>.
You will probably need to customize the parse-html plugin for your purpose
On Mar 26, 2015 4:20 PM, "Richardson, Jacquelyn F." <fl...@ornl.gov>
wrote:

> Hi,
>
> Is there a way to tell nutch to ignore the navigation or footer parts of
> an html page during the crawl process?  Specifically I do not want the
> information in the navigation or footer to be indexed.  My environment is
> Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
>
> Any assistance will be greatly appreciated.
>
> Thanks,
> Jackie
>
>