You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by John Lafitte <jl...@brandextract.com> on 2014/03/18 03:44:12 UTC

Usage Scenarios

We are just starting out using nutch and solr but I have a couple of issues
I can't find any answers for.

1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
seems to capture it and store it as some strange characters "ï»¿".  I can
fix it by removing the BOM and indexchecker confirms it no longer will
index it with those strange characters.  Is there a way to prevent this
from happening without modifying all of the HTML files that contain it?

2. Often a URL gets updated and we want to recraw/index a specific URL on
demand.  I see no way to do this currently without deleting the crawl
directory and starting over.  What is the proper way to handle this
situation?

These are somewhat related because even though I can go through the files
and manually remove the BOM I can't figure out how to have nutch reindex
them.  We are using nutch 1.7 but I have patched a few things and would be
happy to upgrade if it fixes any of this.

Thanks in advance for help.

Re: Usage Scenarios

Posted by John Lafitte <jl...@brandextract.com>.

Thanks Remi.  I presume I basically just need my own version of the crawl
script that uses freegen instead of generate?

For the BOM issue, I searched all over for it, but just now found that
someone has already brought it up.  So I'll try that patch out.
https://issues.apache.org/jira/browse/NUTCH-1733


On Tue, Mar 18, 2014 at 8:18 AM, remi tassing <ta...@gmail.com> wrote:

> Hi John,
>
> Try freegen for the second question:
> http://wiki.apache.org/nutch/bin/nutch_freegen
>
> Remi
>
> On Tuesday, March 18, 2014, John Lafitte <jl...@brandextract.com>
> wrote:
>
> > We are just starting out using nutch and solr but I have a couple of
> issues
> > I can't find any answers for.
> >
> > 1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
> > seems to capture it and store it as some strange characters "ï»¿".  I can
> > fix it by removing the BOM and indexchecker confirms it no longer will
> > index it with those strange characters.  Is there a way to prevent this
> > from happening without modifying all of the HTML files that contain it?
> >
> > 2. Often a URL gets updated and we want to recraw/index a specific URL on
> > demand.  I see no way to do this currently without deleting the crawl
> > directory and starting over.  What is the proper way to handle this
> > situation?
> >
> > These are somewhat related because even though I can go through the files
> > and manually remove the BOM I can't figure out how to have nutch reindex
> > them.  We are using nutch 1.7 but I have patched a few things and would
> be
> > happy to upgrade if it fixes any of this.
> >
> > Thanks in advance for help.
> >
>

Re: Usage Scenarios

Posted by remi tassing <ta...@gmail.com>.

Hi John,

Try freegen for the second question:
http://wiki.apache.org/nutch/bin/nutch_freegen

Remi

On Tuesday, March 18, 2014, John Lafitte <jl...@brandextract.com> wrote:

> We are just starting out using nutch and solr but I have a couple of issues
> I can't find any answers for.
>
> 1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
> seems to capture it and store it as some strange characters "ï»¿".  I can
> fix it by removing the BOM and indexchecker confirms it no longer will
> index it with those strange characters.  Is there a way to prevent this
> from happening without modifying all of the HTML files that contain it?
>
> 2. Often a URL gets updated and we want to recraw/index a specific URL on
> demand.  I see no way to do this currently without deleting the crawl
> directory and starting over.  What is the proper way to handle this
> situation?
>
> These are somewhat related because even though I can go through the files
> and manually remove the BOM I can't figure out how to have nutch reindex
> them.  We are using nutch 1.7 but I have patched a few things and would be
> happy to upgrade if it fixes any of this.
>
> Thanks in advance for help.
>