You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Peter Jameson <pe...@curveos.com> on 2012/02/08 18:04:49 UTC
Finding specific file types only --> *.ics files
Hi,
I'm interested in using Nutch to crawl certain websites looking for only a specific file type, in my case I'm looking for any url that ends with a *.ics construct. I don't need to "parse" the ics files, I just need to know all the .ics files that exist. A list of links would be great.
Can Nutch be configured to do this?
Thanks!
Pete
pete@curveos.com
Re: Finding specific file types only --> *.ics files
Posted by Markus Jelsma <ma...@openindex.io>.
they are not filtered out by default filters.
On Thursday 09 February 2012 15:18:39 Peter Jameson wrote:
> Hi Markus, thanks for your reply! Noob question: how do I ensure .ics
> files are not filtered out from the crawl? I've searched the
> configuration files, but am not sure on parameters to set. Any help is
> greatly appreciated. Thanks!
>
> Sent from my iPad
>
> On Feb 9, 2012, at 4:04 AM, "Markus Jelsma" <ma...@openindex.io>
wrote:
> > Yes you can. Just crawl the websites as usual with Nutch and make sure
> > ics files are not filtered out. There will be attempts to parse the file
> > but they may fail.
> > In the end all links are in your crawlDb and then you can simply extract
> > a list of .ics urls with the old crawldbscanner tool or the new
> > crawldbreader tool.
> >
> > On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
> >> Hi,
> >>
> >> I'm interested in using Nutch to crawl certain websites looking for only
> >> a specific file type, in my case I'm looking for any url that ends with
> >> a *.ics construct. I don't need to "parse" the ics files, I just need
> >> to know all the .ics files that exist. A list of links would be great.
> >>
> >> Can Nutch be configured to do this?
> >>
> >> Thanks!
> >>
> >> Pete
> >> pete@curveos.com
--
Markus Jelsma - CTO - Openindex
Re: Finding specific file types only --> *.ics files
Posted by Peter Jameson <pe...@curveos.com>.
Hi Markus, thanks for your reply! Noob question: how do I ensure .ics files are not filtered out from the crawl? I've searched the configuration files, but am not sure on parameters to set. Any help is greatly appreciated. Thanks!
Sent from my iPad
On Feb 9, 2012, at 4:04 AM, "Markus Jelsma" <ma...@openindex.io> wrote:
> Yes you can. Just crawl the websites as usual with Nutch and make sure ics
> files are not filtered out. There will be attempts to parse the file but they
> may fail.
> In the end all links are in your crawlDb and then you can simply extract a
> list of .ics urls with the old crawldbscanner tool or the new crawldbreader
> tool.
>
> On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
>> Hi,
>>
>> I'm interested in using Nutch to crawl certain websites looking for only a
>> specific file type, in my case I'm looking for any url that ends with a
>> *.ics construct. I don't need to "parse" the ics files, I just need to
>> know all the .ics files that exist. A list of links would be great.
>>
>> Can Nutch be configured to do this?
>>
>> Thanks!
>>
>> Pete
>> pete@curveos.com
>
> --
> Markus Jelsma - CTO - Openindex
Re: Finding specific file types only --> *.ics files
Posted by Markus Jelsma <ma...@openindex.io>.
Yes you can. Just crawl the websites as usual with Nutch and make sure ics
files are not filtered out. There will be attempts to parse the file but they
may fail.
In the end all links are in your crawlDb and then you can simply extract a
list of .ics urls with the old crawldbscanner tool or the new crawldbreader
tool.
On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
> Hi,
>
> I'm interested in using Nutch to crawl certain websites looking for only a
> specific file type, in my case I'm looking for any url that ends with a
> *.ics construct. I don't need to "parse" the ics files, I just need to
> know all the .ics files that exist. A list of links would be great.
>
> Can Nutch be configured to do this?
>
> Thanks!
>
> Pete
> pete@curveos.com
--
Markus Jelsma - CTO - Openindex