You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2008/09/26 16:04:39 UTC

updatedb says URL normalizing and filtering are set to false

When I run the updatedb, it states URL normalizing and filtering are set to false. I think they are already active though? If not, could someone tell me how I switch those on please?

Thanks,
Ed.

$ bin/nutch updatedb crawl/crawldb crawl/segments/20080926135817
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080926135817]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done


_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/

RE: updatedb says URL normalizing and filtering are set to false

Posted by Edward Quick <ed...@hotmail.com>.


> Date: Sun, 28 Sep 2008 23:06:40 +0300
> From: dogacan@gmail.com
> To: nutch-user@lucene.apache.org
> Subject: Re: updatedb says URL normalizing and filtering are set to false
> 
> On Fri, Sep 26, 2008 at 5:04 PM, Edward Quick <ed...@hotmail.com> wrote:
> >
> > When I run the updatedb, it states URL normalizing and filtering are set to false. I think they are already active though? If not, could someone tell me how I switch those on please?
> >
> 
> You don't normally need filter/normalize during updatedb, since all
> urls should already be filtered and normalized by other jobs at that
> point. Still, you can switch them on by passing -normalize -filter to
> updatedb.

Thanks - that is useful to know though, in case I want to fix the list after the crawl is done.

Ed.

> 
> > Thanks,
> > Ed.
> >
> > $ bin/nutch updatedb crawl/crawldb crawl/segments/20080926135817
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20080926135817]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> >
> >
> > _________________________________________________________________
> > Win New York holidays with Kellogg's & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> 
> 
> 
> -- 
> Doğacan Güney

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

Re: updatedb says URL normalizing and filtering are set to false

Posted by Doğacan Güney <do...@gmail.com>.
On Fri, Sep 26, 2008 at 5:04 PM, Edward Quick <ed...@hotmail.com> wrote:
>
> When I run the updatedb, it states URL normalizing and filtering are set to false. I think they are already active though? If not, could someone tell me how I switch those on please?
>

You don't normally need filter/normalize during updatedb, since all
urls should already be filtered and normalized by other jobs at that
point. Still, you can switch them on by passing -normalize -filter to
updatedb.

> Thanks,
> Ed.
>
> $ bin/nutch updatedb crawl/crawldb crawl/segments/20080926135817
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080926135817]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
>
>
> _________________________________________________________________
> Win New York holidays with Kellogg's & Live Search
> http://clk.atdmt.com/UKM/go/111354033/direct/01/



-- 
Doğacan Güney