You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Arian Azin <ar...@gmail.com> on 2013/08/11 09:12:16 UTC

Nutch crawl configuration

Hi Everyone,

I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
get 10 pages from each seed, not including pages from outlinks of the seed. Say
I want to crawl www.example1.com, and some pages there have outlinks to
www.example2.com. Here I provide example1.com as a seed, and want 10
pages (exactly
10, unless there doesn't exist that many) only from from example1.com (I
got 100+ sites to crawl, so I can't set regexes matching every single URL ).
Also, I want pictures and videos to be excluded  from crawl results.
Could anyone please help me with what I should set? I read the
documentation a couple of times with no results.

Thanks,
Arian

Re: crawlID doesn't work?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Kaveh,
No your not missing anything...
crawlID is not equal to the Cassandra keyspace (keyspaces by default set to
webpage for webdb and host for hostdb) instead the crawlId can be used to
generate, identify, maintain, etc. different datasets which can all belong
to the same keyspace.
If you wish to change the Cassandra keyspace name then you can modify the
keyspace child element name attribute in gora-cassandra_mapping.xml. This
will ensure that you create a completely different keyspace which you can
work with.

On Monday, August 12, 2013, kaveh minooie <ka...@plutoz.com> wrote:
> So I am using 2.x with Cassandra and crawlID switch doesn't seem to have
any affect on the name of kyespaces in Cassandra. am I missing anything?
>
> --
> Kaveh Minooie
>

-- 
*Lewis*

crawlID doesn't work?

Posted by kaveh minooie <ka...@plutoz.com>.

So I am using 2.x with Cassandra and crawlID switch doesn't seem to have 
any affect on the name of kyespaces in Cassandra. am I missing anything?

-- 
Kaveh Minooie

Re: Nutch crawl configuration

Posted by kaveh minooie <ka...@plutoz.com>.

to the best of my understanding, you can't really do that.

you can use regex-urlfilter.txt and/or usffix-urlfilter to exclude the 
item that you don't want to crawl, so that should take care of the 
picture and video issue.

but you can't really limit the number of pages that would be fetched per 
site, you can do that per each fetch job thou. so you if it is only 
about 100 sites, you could run the fetch 100 times and only get 10 pages 
each time?

On 08/11/2013 12:12 AM, Arian Azin wrote:
> Hi Everyone,
>
> I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
> get 10 pages from each seed, not including pages from outlinks of the seed. Say
> I want to crawl www.example1.com, and some pages there have outlinks to
> www.example2.com. Here I provide example1.com as a seed, and want 10
> pages (exactly
> 10, unless there doesn't exist that many) only from from example1.com (I
> got 100+ sites to crawl, so I can't set regexes matching every single URL ).
> Also, I want pictures and videos to be excluded  from crawl results.
> Could anyone please help me with what I should set? I read the
> documentation a couple of times with no results.
>
> Thanks,
> Arian
>

-- 
Kaveh Minooie

Re: Nutch crawl configuration

Posted by kaveh minooie <ka...@plutoz.com>.

So I am using 2.x with Cassandra and crawlID switch doesn't seem to have 
any affect on the name of kyespaces in Cassandra. am I missing anything?

-- 
Kaveh Minooie