You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Adamantios Corais <ad...@gmail.com> on 2015/03/22 15:35:50 UTC

How to configure seed and urlfilter confg files in Apache Nutch

I would like to setup Nutch so that it goes through all
http://www.domain.com/classifieds/something/?pg=<page> pages, for goes from
1 to 200 and store the urls of the form
http://www.domain.com/classifieds/something/view/<number>/ where is a ling
number? Then, I would like print out all these urls in my terminal. I am
using Apache Nutch 1.9 and Apache Solr 4.10.4.


*// Adamantios*

Re: How to configure seed and urlfilter confg files in Apache Nutch

Posted by Siddharth Shah <ia...@gmail.com>.
Hello,
         I think you might need to get rid of following line in
your conf/regex-urlfilter.txt, else when injecting seed URLs they will be
filtered out.

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

Give it a try and let me know if this works.

Thank you,
Sidharth

On Mon, Mar 23, 2015 at 3:58 PM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> Apologize for insisting but any help would be highly appreciated since I am
> newbie to Appache Nutch. Thank you!
>
>
> *// Adamantios*
>
>
>
> On Sun, Mar 22, 2015 at 4:35 PM, Adamantios Corais <
> adamantios.corais@gmail.com> wrote:
>
> > I would like to setup Nutch so that it goes through all
> > http://www.domain.com/classifieds/something/?pg=<page> pages, for goes
> > from 1 to 200 and store the urls of the form
> > http://www.domain.com/classifieds/something/view/<number>/ where is a
> > ling number? Then, I would like print out all these urls in my terminal.
> I
> > am using Apache Nutch 1.9 and Apache Solr 4.10.4.
> >
> >
> > *// Adamantios*
> >
> >
> >
>

Re: How to configure seed and urlfilter confg files in Apache Nutch

Posted by Adamantios Corais <ad...@gmail.com>.
Apologize for insisting but any help would be highly appreciated since I am
newbie to Appache Nutch. Thank you!


*// Adamantios*



On Sun, Mar 22, 2015 at 4:35 PM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> I would like to setup Nutch so that it goes through all
> http://www.domain.com/classifieds/something/?pg=<page> pages, for goes
> from 1 to 200 and store the urls of the form
> http://www.domain.com/classifieds/something/view/<number>/ where is a
> ling number? Then, I would like print out all these urls in my terminal. I
> am using Apache Nutch 1.9 and Apache Solr 4.10.4.
>
>
> *// Adamantios*
>
>
>