You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lukáš Vlček <lu...@gmail.com> on 2009/10/18 13:11:41 UTC
Niocchi - java asynchronous crawl library released
Hi,
I just noticed that Niocchi has been released recently.
http://www.niocchi.com/
Niocchi is a java asynchronous crawl library implemented with NIO. It is
designed to crawl several thousands of hosts in parallel on a single low end
server.It is currently being used in production by
Enormo<http://www.enormo.com/> to
crawl thousands of websites daily, and by Vitalprix<http://www.vitalprix.com/>
.
Regards,
Lukas
RE: Niocchi - java asynchronous crawl library released
Posted by Fuad Efendi <fu...@efendi.ca>.
> Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer
> that removes this problem?
I least I can try :)
Leaving it as plugin means I'll need to use ThreadLocal or something...
Re: Niocchi - java asynchronous crawl library released
Posted by Andrzej Bialecki <ab...@getopt.org>.
Fuad Efendi wrote:
> Hi Andrzej,
>
> Yes, I measured/compared (two years ago), I am actually using
> simplified rewritten code based on Nutch, with non-synchronized
> instance per thread.
This was probably based on the original Fetcher code (now
OldFetcher.java) - the new Fetcher uses threads very differently.
>
> Imagine 1024 threads, each having 100 Outlinks and trying to call
> synchronized method... total 102,400 concurrent calls to synchronized
> method (during, in average (network delays), 3-seconds frame)... I
> was even able to have 1024 concurrent threads without any performance
> impact! Also, each synchronization requires additional CPU cycles
> (500-1000) even when concurrency is small.
>
> With non-synchronized, I can't have more than 128 threads - CPU
> overloaded. It run faster. -Fuad
Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer
that removes this problem?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Niocchi - java asynchronous crawl library released
Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Andrzej,
Yes, I measured/compared (two years ago), I am actually using simplified rewritten code based on Nutch, with non-synchronized instance per thread.
Imagine 1024 threads, each having 100 Outlinks and trying to call synchronized method... total 102,400 concurrent calls to synchronized method (during, in average (network delays), 3-seconds frame)... I was even able to have 1024 concurrent threads without any performance impact! Also, each synchronization requires additional CPU cycles (500-1000) even when concurrency is small.
With non-synchronized, I can't have more than 128 threads - CPU overloaded. It run faster.
-Fuad
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: October-19-09 5:47 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Niocchi - java asynchronous crawl library released
>
> Fuad Efendi wrote:
> > Hi Andrzej,
> >
> > Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized
> singleton (shared by multiple threads). And similar synchronized plugins which
> should be probably refactored to Nutch core...
>
> It's not a singleton, but it's true that the normalize() method is
> synchronized. Did you actually measure the impact of this
> synchronization on the crawling speed? I very much doubt it outweighs
> the impact of politeness limits.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
Re: Niocchi - java asynchronous crawl library released
Posted by Andrzej Bialecki <ab...@getopt.org>.
Fuad Efendi wrote:
> Hi Andrzej,
>
> Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core...
It's not a singleton, but it's true that the normalize() method is
synchronized. Did you actually measure the impact of this
synchronization on the crawling speed? I very much doubt it outweighs
the impact of politeness limits.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Niocchi - java asynchronous crawl library released
Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Andrzej,
Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core...
-Fuad
> Most of
> the time the politeness limits (max rate of requests per host) are the
> bottleneck.
Re: Niocchi - java asynchronous crawl library released
Posted by Andrzej Bialecki <ab...@getopt.org>.
Lukáš Vlček wrote:
> Hi,
>
> I just noticed that Niocchi has been released recently.
> http://www.niocchi.com/
>
> Niocchi is a java asynchronous crawl library implemented with NIO. It is
> designed to crawl several thousands of hosts in parallel on a single low
> end server.It is currently being used in production by Enormo
> <http://www.enormo.com/> to crawl thousands of websites daily, and
> by Vitalprix <http://www.vitalprix.com/>.
Well, of course we should optimize our use of resources, and we could
check what this library can offer - but I doubt that optimizations on
this level would bring significant benefits in terms of increased speed
of crawling - low-level IO handling is rarely the bottleneck. Most of
the time the politeness limits (max rate of requests per host) are the
bottleneck.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Niocchi - java asynchronous crawl library released
Posted by Fuad Efendi <fu...@efendi.ca>.
I like architectural ideas behind Apache MINA (inspired by SEDA): for some (CPU-intensive) processing (such as parsing of content) we need single thread per single CPU core, for others (I/O bound) much more threads (waiting for response from network socket). It’s not just NIO...
-Fuad
From: Lukáš Vlček [mailto:lukas.vlcek@gmail.com]
Sent: October-18-09 7:12 AM
To: nutch-dev@lucene.apache.org; droids-dev@incubator.apache.org
Subject: Niocchi - java asynchronous crawl library released
Hi,
I just noticed that Niocchi has been released recently.
http://www.niocchi.com/
Niocchi is a java asynchronous crawl library implemented with NIO. It is designed to crawl several thousands of hosts in parallel on a single low end server.It is currently being used in production by Enormo <http://www.enormo.com/> to crawl thousands of websites daily, and by Vitalprix <http://www.vitalprix.com/> .
Regards,
Lukas
Re: Niocchi - java asynchronous crawl library released
Posted by Andrzej Bialecki <ab...@getopt.org>.
Lukáš Vlček wrote:
> Hi,
>
> I just noticed that Niocchi has been released recently.
> http://www.niocchi.com/
>
> Niocchi is a java asynchronous crawl library implemented with NIO. It is
> designed to crawl several thousands of hosts in parallel on a single low
> end server.It is currently being used in production by Enormo
> <http://www.enormo.com/> to crawl thousands of websites daily, and
> by Vitalprix <http://www.vitalprix.com/>.
Well, of course we should optimize our use of resources, and we could
check what this library can offer - but I doubt that optimizations on
this level would bring significant benefits in terms of increased speed
of crawling - low-level IO handling is rarely the bottleneck. Most of
the time the politeness limits (max rate of requests per host) are the
bottleneck.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com