You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arian Azin <ar...@gmail.com> on 2013/08/11 09:12:16 UTC
Nutch crawl configuration
Hi Everyone,
I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
get 10 pages from each seed, not including pages from outlinks of the seed. Say
I want to crawl www.example1.com, and some pages there have outlinks to
www.example2.com. Here I provide example1.com as a seed, and want 10
pages (exactly
10, unless there doesn't exist that many) only from from example1.com (I
got 100+ sites to crawl, so I can't set regexes matching every single URL ).
Also, I want pictures and videos to be excluded from crawl results.
Could anyone please help me with what I should set? I read the
documentation a couple of times with no results.
Thanks,
Arian
Re: crawlID doesn't work?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kaveh,
No your not missing anything...
crawlID is not equal to the Cassandra keyspace (keyspaces by default set to
webpage for webdb and host for hostdb) instead the crawlId can be used to
generate, identify, maintain, etc. different datasets which can all belong
to the same keyspace.
If you wish to change the Cassandra keyspace name then you can modify the
keyspace child element name attribute in gora-cassandra_mapping.xml. This
will ensure that you create a completely different keyspace which you can
work with.
On Monday, August 12, 2013, kaveh minooie <ka...@plutoz.com> wrote:
> So I am using 2.x with Cassandra and crawlID switch doesn't seem to have
any affect on the name of kyespaces in Cassandra. am I missing anything?
>
> --
> Kaveh Minooie
>
--
*Lewis*
crawlID doesn't work?
Posted by kaveh minooie <ka...@plutoz.com>.
So I am using 2.x with Cassandra and crawlID switch doesn't seem to have
any affect on the name of kyespaces in Cassandra. am I missing anything?
--
Kaveh Minooie
Re: Nutch crawl configuration
Posted by kaveh minooie <ka...@plutoz.com>.
to the best of my understanding, you can't really do that.
you can use regex-urlfilter.txt and/or usffix-urlfilter to exclude the
item that you don't want to crawl, so that should take care of the
picture and video issue.
but you can't really limit the number of pages that would be fetched per
site, you can do that per each fetch job thou. so you if it is only
about 100 sites, you could run the fetch 100 times and only get 10 pages
each time?
On 08/11/2013 12:12 AM, Arian Azin wrote:
> Hi Everyone,
>
> I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
> get 10 pages from each seed, not including pages from outlinks of the seed. Say
> I want to crawl www.example1.com, and some pages there have outlinks to
> www.example2.com. Here I provide example1.com as a seed, and want 10
> pages (exactly
> 10, unless there doesn't exist that many) only from from example1.com (I
> got 100+ sites to crawl, so I can't set regexes matching every single URL ).
> Also, I want pictures and videos to be excluded from crawl results.
> Could anyone please help me with what I should set? I read the
> documentation a couple of times with no results.
>
> Thanks,
> Arian
>
--
Kaveh Minooie
Re: Nutch crawl configuration
Posted by kaveh minooie <ka...@plutoz.com>.
So I am using 2.x with Cassandra and crawlID switch doesn't seem to have
any affect on the name of kyespaces in Cassandra. am I missing anything?
--
Kaveh Minooie