You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lukáš Vlček <lu...@gmail.com> on 2009/10/18 13:11:41 UTC

Niocchi - java asynchronous crawl library released

Hi,
I just noticed that Niocchi has been released recently.
http://www.niocchi.com/

Niocchi is a java asynchronous crawl library implemented with NIO. It is
designed to crawl several thousands of hosts in parallel on a single low end
server.It is currently being used in production by
Enormo<http://www.enormo.com/> to
crawl thousands of websites daily, and by Vitalprix<http://www.vitalprix.com/>
.

Regards,
Lukas

RE: Niocchi - java asynchronous crawl library released

Posted by Fuad Efendi <fu...@efendi.ca>.

> Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer
> that removes this problem?

I least I can try :)
Leaving it as plugin means I'll need to use ThreadLocal or something...

Re: Niocchi - java asynchronous crawl library released

Posted by Andrzej Bialecki <ab...@getopt.org>.

Fuad Efendi wrote:
> Hi Andrzej,
> 
> Yes, I measured/compared (two years ago), I am actually using
> simplified rewritten code based on Nutch, with non-synchronized
> instance per thread.

This was probably based on the original Fetcher code (now 
OldFetcher.java) - the new Fetcher uses threads very differently.

> 
> Imagine 1024 threads, each having 100 Outlinks and trying to call
> synchronized method... total 102,400 concurrent calls to synchronized
> method (during, in average (network delays), 3-seconds frame)... I
> was even able to have 1024 concurrent threads without any performance
> impact! Also, each synchronization requires additional CPU cycles
> (500-1000) even when concurrency is small.
> 
> With non-synchronized, I can't have more than 128 threads - CPU
> overloaded. It run faster. -Fuad

Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer 
that removes this problem?


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Niocchi - java asynchronous crawl library released

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Andrzej,

Yes, I measured/compared (two years ago), I am actually using simplified rewritten code based on Nutch, with non-synchronized instance per thread.

Imagine 1024 threads, each having 100 Outlinks and trying to call synchronized method... total 102,400 concurrent calls to synchronized method (during, in average (network delays), 3-seconds frame)... I was even able to have 1024 concurrent threads without any performance impact! Also, each synchronization requires additional CPU cycles (500-1000) even when concurrency is small.

With non-synchronized, I can't have more than 128 threads - CPU overloaded. It run faster.
-Fuad

> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: October-19-09 5:47 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Niocchi - java asynchronous crawl library released
> 
> Fuad Efendi wrote:
> > Hi Andrzej,
> >
> > Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized
> singleton (shared by multiple threads). And similar synchronized plugins which
> should be probably refactored to Nutch core...
> 
> It's not a singleton, but it's true that the normalize() method is
> synchronized. Did you actually measure the impact of this
> synchronization on the crawling speed? I very much doubt it outweighs
> the impact of politeness limits.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Re: Niocchi - java asynchronous crawl library released

Posted by Andrzej Bialecki <ab...@getopt.org>.

Fuad Efendi wrote:
> Hi Andrzej,
> 
> Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core...

It's not a singleton, but it's true that the normalize() method is 
synchronized. Did you actually measure the impact of this 
synchronization on the crawling speed? I very much doubt it outweighs 
the impact of politeness limits.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Niocchi - java asynchronous crawl library released

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Andrzej,

Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core...

-Fuad


> Most of
> the time the politeness limits (max rate of requests per host) are the
> bottleneck.

Re: Niocchi - java asynchronous crawl library released

Posted by Andrzej Bialecki <ab...@getopt.org>.

Lukáš Vlček wrote:
> Hi,
> 
> I just noticed that Niocchi has been released recently.
> http://www.niocchi.com/
> 
> Niocchi is a java asynchronous crawl library implemented with NIO. It is 
> designed to crawl several thousands of hosts in parallel on a single low 
> end server.It is currently being used in production by Enormo 
> <http://www.enormo.com/> to crawl thousands of websites daily, and 
> by Vitalprix <http://www.vitalprix.com/>.

Well, of course we should optimize our use of resources, and we could 
check what this library can offer - but I doubt that optimizations on 
this level would bring significant benefits in terms of increased speed 
of crawling - low-level IO handling is rarely the bottleneck. Most of 
the time the politeness limits (max rate of requests per host) are the 
bottleneck.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Niocchi - java asynchronous crawl library released

Posted by Fuad Efendi <fu...@efendi.ca>.

I like architectural ideas behind Apache MINA (inspired by SEDA): for some (CPU-intensive) processing (such as parsing of content) we need single thread per single CPU core, for others (I/O bound) much more threads (waiting for response from network socket). It’s not just NIO...

-Fuad

 

 

 

From: Lukáš Vlček [mailto:lukas.vlcek@gmail.com] 
Sent: October-18-09 7:12 AM
To: nutch-dev@lucene.apache.org; droids-dev@incubator.apache.org
Subject: Niocchi - java asynchronous crawl library released

 

Hi,

 

I just noticed that Niocchi has been released recently.

http://www.niocchi.com/

 

Niocchi is a java asynchronous crawl library implemented with NIO. It is designed to crawl several thousands of hosts in parallel on a single low end server.It is currently being used in production by Enormo <http://www.enormo.com/>  to crawl thousands of websites daily, and by Vitalprix <http://www.vitalprix.com/> .

 

Regards,

Lukas

Re: Niocchi - java asynchronous crawl library released

Posted by Andrzej Bialecki <ab...@getopt.org>.

Lukáš Vlček wrote:
> Hi,
> 
> I just noticed that Niocchi has been released recently.
> http://www.niocchi.com/
> 
> Niocchi is a java asynchronous crawl library implemented with NIO. It is 
> designed to crawl several thousands of hosts in parallel on a single low 
> end server.It is currently being used in production by Enormo 
> <http://www.enormo.com/> to crawl thousands of websites daily, and 
> by Vitalprix <http://www.vitalprix.com/>.

Well, of course we should optimize our use of resources, and we could 
check what this library can offer - but I doubt that optimizations on 
this level would bring significant benefits in terms of increased speed 
of crawling - low-level IO handling is rarely the bottleneck. Most of 
the time the politeness limits (max rate of requests per host) are the 
bottleneck.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com