You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Otis Gospodnetic <og...@yahoo.com> on 2009/12/09 23:12:10 UTC

java.net.URL synchronization

Hello,

Has anyone seen this:
http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ?

Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)?


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

RE: java.net.URL synchronization

Posted by Fuad Efendi <fu...@efendi.ca>.

Tomcat uses own slightly different version of URL class:

http://tomcat.apache.org/tomcat-5.5-doc/catalina/docs/api/index.html
URL is designed to provide public APIs for parsing and synthesizing Uniform
Resource Locators as similar as possible to the APIs of java.net.URL, but
without the ability to open a stream or connection. One of the consequences
of this is that you can construct URLs for protocols for which a
URLStreamHandler is not available (such as an "https" URL when JSSE is not
installed).



Synchonized staff in java.net.URL is URLStreamHandler -related.


> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: December-09-09 5:40 PM
> To: nutch-dev@lucene.apache.org
> Subject: RE: java.net.URL synchronization
> 
> I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
> Hashtable:
> 
> 
>     public URL(String protocol, String host, int port, String file,
> 	       URLStreamHandler handler) throws MalformedURLException {
> 
> ...
> 	if (handler == null &&
>             (handler = getURLStreamHandler(protocol)) == null) {
>             throw new MalformedURLException("unknown protocol: " +
> protocol);
>         }
> 
> ...
> 
> 
> However, I don't think it hurts because both architecture (at least, BIXO)
> run single thread in a single JVM to process, for instance, Outlinks. Only
> "Fetch" part is multithreaded, but it doesn't use URL class.
> 
> 
> Not sure about Nutch, how the fetch list is generated... if multithreaded
> then "shared" between threads RegexUrlNormalizer is even bigger problem...
> 
> 
> Fuad Efendi
> +1 416-993-2060
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
> 
> 
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com]
> > Sent: December-09-09 5:12 PM
> > To: nutch-dev@lucene.apache.org
> > Subject: java.net.URL synchronization
> >
> > Hello,
> >
> > Has anyone seen this:
> > http://www.supermind.org/blog/580/java-net-url-synchronization-
> bottleneck
> > ?
> >
> > Is this something that needs to be addressed in Nutch (and thus in Bixo,
> > and thus in the common crawler project)?
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> 
> 


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

RE: java.net.URL synchronization

Posted by Fuad Efendi <fu...@efendi.ca>.

I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
Hashtable:
      
  
    public URL(String protocol, String host, int port, String file,
	       URLStreamHandler handler) throws MalformedURLException {

...
	if (handler == null &&
            (handler = getURLStreamHandler(protocol)) == null) {
            throw new MalformedURLException("unknown protocol: " +
protocol);
        }

...


However, I don't think it hurts because both architecture (at least, BIXO)
run single thread in a single JVM to process, for instance, Outlinks. Only
"Fetch" part is multithreaded, but it doesn't use URL class.


Not sure about Nutch, how the fetch list is generated... if multithreaded
then "shared" between threads RegexUrlNormalizer is even bigger problem... 


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca/
Data Mining, Vertical Search


> -----Original Message-----
> From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com]
> Sent: December-09-09 5:12 PM
> To: nutch-dev@lucene.apache.org
> Subject: java.net.URL synchronization
> 
> Hello,
> 
> Has anyone seen this:
> http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck
> ?
> 
> Is this something that needs to be addressed in Nutch (and thus in Bixo,
> and thus in the common crawler project)?
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch