You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Peter Jameson <pe...@curveos.com> on 2012/02/08 18:04:49 UTC

Finding specific file types only --> *.ics files

Hi,

I'm interested in using Nutch to crawl certain websites looking for only a specific file type, in my case I'm looking for any url that ends with a *.ics construct.  I don't need to "parse" the ics files, I just need to know all the .ics files that exist.  A list of links would be great.

Can Nutch be configured to do this?

Thanks!

Pete
pete@curveos.com



Re: Finding specific file types only --> *.ics files

Posted by Markus Jelsma <ma...@openindex.io>.
they are not filtered out by default filters.

On Thursday 09 February 2012 15:18:39 Peter Jameson wrote:
> Hi Markus,  thanks for your reply!  Noob question:  how do I ensure .ics
> files are not filtered out from the crawl?  I've searched the
> configuration files, but am not sure on parameters to set.  Any help is
> greatly appreciated.  Thanks!
> 
> Sent from my iPad
> 
> On Feb 9, 2012, at 4:04 AM, "Markus Jelsma" <ma...@openindex.io> 
wrote:
> > Yes you can. Just crawl the websites as usual with Nutch and make sure
> > ics files are not filtered out. There will be attempts to parse the file
> > but they may fail.
> > In the end all links are in your crawlDb and then you can simply extract
> > a list of .ics urls with the old crawldbscanner tool or the new
> > crawldbreader tool.
> > 
> > On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
> >> Hi,
> >> 
> >> I'm interested in using Nutch to crawl certain websites looking for only
> >> a specific file type, in my case I'm looking for any url that ends with
> >> a *.ics construct.  I don't need to "parse" the ics files, I just need
> >> to know all the .ics files that exist.  A list of links would be great.
> >> 
> >> Can Nutch be configured to do this?
> >> 
> >> Thanks!
> >> 
> >> Pete
> >> pete@curveos.com

-- 
Markus Jelsma - CTO - Openindex

Re: Finding specific file types only --> *.ics files

Posted by Peter Jameson <pe...@curveos.com>.
Hi Markus,  thanks for your reply!  Noob question:  how do I ensure .ics files are not filtered out from the crawl?  I've searched the configuration files, but am not sure on parameters to set.  Any help is greatly appreciated.  Thanks!

Sent from my iPad

On Feb 9, 2012, at 4:04 AM, "Markus Jelsma" <ma...@openindex.io> wrote:

> Yes you can. Just crawl the websites as usual with Nutch and make sure ics 
> files are not filtered out. There will be attempts to parse the file but they 
> may fail.
> In the end all links are in your crawlDb and then you can simply extract a 
> list of .ics urls with the old crawldbscanner tool or the new crawldbreader 
> tool.
> 
> On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
>> Hi,
>> 
>> I'm interested in using Nutch to crawl certain websites looking for only a
>> specific file type, in my case I'm looking for any url that ends with a
>> *.ics construct.  I don't need to "parse" the ics files, I just need to
>> know all the .ics files that exist.  A list of links would be great.
>> 
>> Can Nutch be configured to do this?
>> 
>> Thanks!
>> 
>> Pete
>> pete@curveos.com
> 
> -- 
> Markus Jelsma - CTO - Openindex

Re: Finding specific file types only --> *.ics files

Posted by Markus Jelsma <ma...@openindex.io>.
Yes you can. Just crawl the websites as usual with Nutch and make sure ics 
files are not filtered out. There will be attempts to parse the file but they 
may fail.
In the end all links are in your crawlDb and then you can simply extract a 
list of .ics urls with the old crawldbscanner tool or the new crawldbreader 
tool.

On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
> Hi,
> 
> I'm interested in using Nutch to crawl certain websites looking for only a
> specific file type, in my case I'm looking for any url that ends with a
> *.ics construct.  I don't need to "parse" the ics files, I just need to
> know all the .ics files that exist.  A list of links would be great.
> 
> Can Nutch be configured to do this?
> 
> Thanks!
> 
> Pete
> pete@curveos.com

-- 
Markus Jelsma - CTO - Openindex