You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/20 14:19:14 UTC

Static initializers

Hi,

This was mentioned before: there are many places in Nutch that rely on 
static initializers. This is so-so or sometimes plainly bad, depending 
on a situation.

I'm facing a problem now with URLFilters. I need to run several fetchers 
inside a single VM, with different parameters such as different url 
patterns (which is handled by URLFilters). But even if I specify 
different NutchConf-s to each fetcher, the list of implementations and 
the instances of URLFilter[] in URLFilters are initialized only once, 
and this happens from the default configuration obtained through a call 
to static NutchConf.get().

I would like to change it somehow, but I'm not sure how... One way to 
solve this would be to instantiate the plugins based on a concrete 
NutchConf instance, like this:

URLFilters:

    private URLFilters(NutchConf) {
       // initialize plugins based on this instance of NutchConf
    }

    public static URLFilters get(NutchConf conf) {
       URLFilters res = (URLFilters)conf.get("urlfilters.key");
       if (res == null) {
          res = new URLFilters(conf);
          conf.put("urlfilters.key", res);
       }
       return res;
    }

In case you are running with a single NutchConf per JVM it doesn't change anything. In case you want to run several different configs in a single JVM this approach provides the solution. We could follow this strategy for other plugin registry facades. Comments?


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Static initializers

Posted by ma...@provinzial.com.
Hi, 

This is what i did to make NutchConf behave not so static, 
without patching any of those 195 places Stefan mentioned.

NutchConf.get() yields the current config. 
OpenConf sets a new current config.
finally CloseConf closes this config.

But be warned about issues with the plugin cache mentioned earlier.

Greeting Marcel Schnippe.

public class NutchConf {
 //...
 
  public static final NutchConf DEFAULT = new NutchConf(); 
  private static ThreadLocal threadNutchConf = new ThreadLocal() {
     protected synchronized Object initialValue() {
       Stack confs = new Stack();
       confs.add(NutchConf.DEFAULT);
       return confs;
     }
  };
  /** Return the current default configuration. (see {@link #OpenConf}) */
  public static NutchConf get() { 
    return (NutchConf) (((Stack) 
(threadNutchConf.get())).lastElement());};
  /** Open new thread specific configuration, which will be returned by
   * calls to {@link #get} until finally closed by  {@link #CloseConf}.
   * @param conf a NutchConf generated with new NutchConf and {@link 
#addConfResource}.
   */
  public static void OpenConf (NutchConf conf) {
    Stack confs = (Stack) (threadNutchConf.get());
    confs.add(conf);
  };
  /** Close configuration opend by {@link #OpenConf}, return to previous 
or 
    default+site configuration */
  public static void CloseConf() { 
    ((Stack) (threadNutchConf.get())).pop();
  };

 //...
};


Re: Static initializers

Posted by Stefan Groschupf <sg...@media-style.com>.
Andrzej,
well I'm not ready with digging into the problem but want to ask some  
more questions.
BTW I counted 195 places that use NutchConf.get(), so this will be a  
bigger patch. :)

As I mentioned I would love to go the inversion of control way, so  
not using nutchConf in the constructor but make classes implementing  
the Configurable interface. This for example would be sensefully for  
all classed realizing a extension.
But there are also classes where this makes no sense. For example I  
would suggest to change the PluginRegestry from a  singleton to a  
'normal' object, in this case I guess it make sense to use the  
nutchConf in the constructor, since the configuration here only need  
to know the
include and exclude regex for the plugins.
So:
>    Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
This makes sense, here we can check if the class implements the  
configurable interface and if so instantiate the object and set the  
configuration.

>    ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
We don't need NutchConf here since if I understand it correct this is  
only needed to identify the activated plugins and this is done until  
regestry instantiation that in this case take a NutchConf as parameter.

>    PluginRepository.getExtensionPoint(String) -> getExtensionPoint 
> (String, NutchConf)
We don't need it here as well, since we use NutchConf until regestry  
instantiation.
The other case would be that we have to build up the plugin  
dependency graph for each method call.
Would you agree to have a several plugin regestries with may be  
different NutchConf's but instantiate extensions with nutchConf but  
not query ExtentsionPoints etc?

> etc, etc...
>
> The way this would work would be similar to the mechanism described  
> above: if plugin instances are not created yet, they would be  
> created once (based on the current NutchConf argument), and then  
> cached in this NutchConf instance.
I guess this is difficult.
First we have the plugin class instances, most or may all plugins I  
know do not have a plugin class implementation, second we have the  
extensions classes that at least do not need to implement a specific  
interface from the plugin regestry point of view (only such things  
like index filter interface etc.)
Caching plugin class instances makes sense since actually there is  
only one  plugin class instance per plugin in the jvm. However there  
will be many instances for each extension class, since e.g. the  
parser or protocoll runs multithreaded.

>
> And also the plugin implementations would have to extend  
> NutchConfigured, taking NutchConf as the argument to their  
> constructors - because now the Extension.getExtensionInstance would  
> pass the current NutchConf instance to their contructors.

In general my point of view is that:
In case we touch this issue anyway I would love to do a radical  
solution, since i have a other understanding of handling parameters  
than collect them in a kind of map and make the map general accessible.
Instead of giving any object access to the configuration object and  
handle properties like a bazar I would prefer handle configuration  
only in the first object in the stack, that would be in our case for  
example the indexing tool.
Than the indexing tool instantiate the plugin registry only with the  
required properties that would be part of the constructor, e.g.  
pluginFolders, include, exlude reg ex and autoactivation flag.
Later the extension instances can be also get some more values  
injected, but has in general no access to the configuration object.   
This would first of all make things better testable but also allows  
much much more flexibility to run several different fetchers  in one  
jvm.
Anyway this would be may a imporvement suggestion from me for nutch  
2.0 or 3.0 for now we would be some steps forward just changing  
NutchConf access to non static style.


I hopefully found some time until next days to do some experiments  
and will come back with some more details.

However we should found a general agreement about the way we go,  
since changing code in 195 places and lines that depends on it for  
just nothing is not that funny.

Stefan



Re: Static initializers

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

>Andrzej,
>
>How do you choose the NutchConf to use ?
>  
>

It is provided as an argument to all constructors.

>Here is a short discussion I had with Doug about a kind of dynamic NutchConf
>inside the same JVM:
>
>"... By looking at the mailing lists archives it seems that having some
>behavior depending on the documents URL is a recurrent problem (for instance
>for boosting documents matching a url pattern - NUTCH-16 issue, and many
>other topics).
>So, our idea is to provide a way to provide a "dynamic" nutch configuration
>(that override the default one, like for the nutch-site) based on documents
>matching urls pattern. The idea is as follow:
>  
>

Well, it's a neat idea, but it's not necessarily what I was proposing. 
My proposal could be the first step to implement this.

>1. The default configuration is as usualy the nutch-default.xml file
>
>2. An xml file can map some url regexp to some many others configurations
>files (that override the nutch-default):
><nutch:conf>
>  <url regexp="http://www.mydomain1.com/*">
>    <!-- A set of nutch properties that override the nutch-default for this
>domain -->
>    <property>
>        <name>property1</name>
>        <value>value1</name>
>    </property>
>    ....
>   </url>
>   ....
></nutch:conf>"
>
>What do you think about this?
>  
>

Yes, if you can specify different configs for every run, or even for 
every invocation, it's certainly possible.

>
>Looking deeper, this is more messy that I thought... Some changes would
>  
>
>>be required to the plugin instantiation mechanisms, e.g.:
>>
>>    Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
>>    ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
>>    PluginRepository.getExtensionPoint(String) ->
>>getExtensionPoint(String, NutchConf)
>>
>>etc, etc...
>>
>>The way this would work would be similar to the mechanism described
>>above: if plugin instances are not created yet, they would be created
>>once (based on the current NutchConf argument), and then cached in this
>>NutchConf instance.
>>
>>And also the plugin implementations would have to extend
>>NutchConfigured, taking NutchConf as the argument to their constructors
>>- because now the Extension.getExtensionInstance would pass the current
>>NutchConf instance to their contructors.
>>    
>>
>
>That's exactly what I had in mind while speaking about a dynamic NutchConf
>with Doug.
>For me it's a +1
>The only think I don't really like is extending the NutchConfigured, but it
>is the most secured way to implement it.
>  
>

Well, it's a form of enforcing a contract for the constructors. There is 
no other way to do it in Java - you can't specify the required 
constructors in an interface. OTOH you have the NutchConfigurable 
interface, which we could use instead, but then you have to remember to 
call setConf() before you do anything else...

I'll work on this to see where it leads.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Static initializers

Posted by Jérôme Charron <je...@gmail.com>.
Andrzej,

How do you choose the NutchConf to use ?
Here is a short discussion I had with Doug about a kind of dynamic NutchConf
inside the same JVM:

"... By looking at the mailing lists archives it seems that having some
behavior depending on the documents URL is a recurrent problem (for instance
for boosting documents matching a url pattern - NUTCH-16 issue, and many
other topics).
So, our idea is to provide a way to provide a "dynamic" nutch configuration
(that override the default one, like for the nutch-site) based on documents
matching urls pattern. The idea is as follow:

1. The default configuration is as usualy the nutch-default.xml file

2. An xml file can map some url regexp to some many others configurations
files (that override the nutch-default):
<nutch:conf>
  <url regexp="http://www.mydomain1.com/*">
    <!-- A set of nutch properties that override the nutch-default for this
domain -->
    <property>
        <name>property1</name>
        <value>value1</name>
    </property>
    ....
   </url>
   ....
</nutch:conf>"

What do you think about this?


Looking deeper, this is more messy that I thought... Some changes would
> be required to the plugin instantiation mechanisms, e.g.:
>
>     Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
>     ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
>     PluginRepository.getExtensionPoint(String) ->
> getExtensionPoint(String, NutchConf)
>
> etc, etc...
>
> The way this would work would be similar to the mechanism described
> above: if plugin instances are not created yet, they would be created
> once (based on the current NutchConf argument), and then cached in this
> NutchConf instance.
>
> And also the plugin implementations would have to extend
> NutchConfigured, taking NutchConf as the argument to their constructors
> - because now the Extension.getExtensionInstance would pass the current
> NutchConf instance to their contructors.

That's exactly what I had in mind while speaking about a dynamic NutchConf
with Doug.
For me it's a +1
The only think I don't really like is extending the NutchConfigured, but it
is the most secured way to implement it.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Static initializers

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:

> URLFilters:
>
>    private URLFilters(NutchConf) {
>       // initialize plugins based on this instance of NutchConf
>    }
>
>    public static URLFilters get(NutchConf conf) {
>       URLFilters res = (URLFilters)conf.get("urlfilters.key");
>       if (res == null) {
>          res = new URLFilters(conf);
>          conf.put("urlfilters.key", res);
>       }
>       return res;
>    }
>

Looking deeper, this is more messy that I thought... Some changes would 
be required to the plugin instantiation mechanisms, e.g.:

    Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
    ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
    PluginRepository.getExtensionPoint(String) -> 
getExtensionPoint(String, NutchConf)

etc, etc...

The way this would work would be similar to the mechanism described 
above: if plugin instances are not created yet, they would be created 
once (based on the current NutchConf argument), and then cached in this 
NutchConf instance.

And also the plugin implementations would have to extend 
NutchConfigured, taking NutchConf as the argument to their constructors 
- because now the Extension.getExtensionInstance would pass the current 
NutchConf instance to their contructors.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Static initializers

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
right this is a know problem and discussed several times, we should  
start solving this. :-)
I suggest that we make the Plugin Class implementing the Configurable  
interface. In case a plugin needs any configuration value it will  
request them from the plugin instance.
The next step would be changing the plugin Registry from a singleton  
to a normal object that need to be instantiated with a nutch  
configuration object in the constructor.

In general I suggest we use a Inversion of control style mechanism  
(http://www.martinfowler.com/articles/injection.html) to solve these  
kind of problems, this is from my point of view the cleanest possible  
solution and allows also changing e.g. configuration objects until  
runtime.


Stefan


Am 20.12.2005 um 14:19 schrieb Andrzej Bialecki:

> Hi,
>
> This was mentioned before: there are many places in Nutch that rely  
> on static initializers. This is so-so or sometimes plainly bad,  
> depending on a situation.
>
> I'm facing a problem now with URLFilters. I need to run several  
> fetchers inside a single VM, with different parameters such as  
> different url patterns (which is handled by URLFilters). But even  
> if I specify different NutchConf-s to each fetcher, the list of  
> implementations and the instances of URLFilter[] in URLFilters are  
> initialized only once, and this happens from the default  
> configuration obtained through a call to static NutchConf.get().
>
> I would like to change it somehow, but I'm not sure how... One way  
> to solve this would be to instantiate the plugins based on a  
> concrete NutchConf instance, like this:
>
> URLFilters:
>
>    private URLFilters(NutchConf) {
>       // initialize plugins based on this instance of NutchConf
>    }
>
>    public static URLFilters get(NutchConf conf) {
>       URLFilters res = (URLFilters)conf.get("urlfilters.key");
>       if (res == null) {
>          res = new URLFilters(conf);
>          conf.put("urlfilters.key", res);
>       }
>       return res;
>    }
>
> In case you are running with a single NutchConf per JVM it doesn't  
> change anything. In case you want to run several different configs  
> in a single JVM this approach provides the solution. We could  
> follow this strategy for other plugin registry facades. Comments?
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>