You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/20 14:19:14 UTC
Static initializers
Hi,
This was mentioned before: there are many places in Nutch that rely on
static initializers. This is so-so or sometimes plainly bad, depending
on a situation.
I'm facing a problem now with URLFilters. I need to run several fetchers
inside a single VM, with different parameters such as different url
patterns (which is handled by URLFilters). But even if I specify
different NutchConf-s to each fetcher, the list of implementations and
the instances of URLFilter[] in URLFilters are initialized only once,
and this happens from the default configuration obtained through a call
to static NutchConf.get().
I would like to change it somehow, but I'm not sure how... One way to
solve this would be to instantiate the plugins based on a concrete
NutchConf instance, like this:
URLFilters:
private URLFilters(NutchConf) {
// initialize plugins based on this instance of NutchConf
}
public static URLFilters get(NutchConf conf) {
URLFilters res = (URLFilters)conf.get("urlfilters.key");
if (res == null) {
res = new URLFilters(conf);
conf.put("urlfilters.key", res);
}
return res;
}
In case you are running with a single NutchConf per JVM it doesn't change anything. In case you want to run several different configs in a single JVM this approach provides the solution. We could follow this strategy for other plugin registry facades. Comments?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Static initializers
Posted by ma...@provinzial.com.
Hi,
This is what i did to make NutchConf behave not so static,
without patching any of those 195 places Stefan mentioned.
NutchConf.get() yields the current config.
OpenConf sets a new current config.
finally CloseConf closes this config.
But be warned about issues with the plugin cache mentioned earlier.
Greeting Marcel Schnippe.
public class NutchConf {
//...
public static final NutchConf DEFAULT = new NutchConf();
private static ThreadLocal threadNutchConf = new ThreadLocal() {
protected synchronized Object initialValue() {
Stack confs = new Stack();
confs.add(NutchConf.DEFAULT);
return confs;
}
};
/** Return the current default configuration. (see {@link #OpenConf}) */
public static NutchConf get() {
return (NutchConf) (((Stack)
(threadNutchConf.get())).lastElement());};
/** Open new thread specific configuration, which will be returned by
* calls to {@link #get} until finally closed by {@link #CloseConf}.
* @param conf a NutchConf generated with new NutchConf and {@link
#addConfResource}.
*/
public static void OpenConf (NutchConf conf) {
Stack confs = (Stack) (threadNutchConf.get());
confs.add(conf);
};
/** Close configuration opend by {@link #OpenConf}, return to previous
or
default+site configuration */
public static void CloseConf() {
((Stack) (threadNutchConf.get())).pop();
};
//...
};
Re: Static initializers
Posted by Stefan Groschupf <sg...@media-style.com>.
Andrzej,
well I'm not ready with digging into the problem but want to ask some
more questions.
BTW I counted 195 places that use NutchConf.get(), so this will be a
bigger patch. :)
As I mentioned I would love to go the inversion of control way, so
not using nutchConf in the constructor but make classes implementing
the Configurable interface. This for example would be sensefully for
all classed realizing a extension.
But there are also classes where this makes no sense. For example I
would suggest to change the PluginRegestry from a singleton to a
'normal' object, in this case I guess it make sense to use the
nutchConf in the constructor, since the configuration here only need
to know the
include and exclude regex for the plugins.
So:
> Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
This makes sense, here we can check if the class implements the
configurable interface and if so instantiate the object and set the
configuration.
> ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
We don't need NutchConf here since if I understand it correct this is
only needed to identify the activated plugins and this is done until
regestry instantiation that in this case take a NutchConf as parameter.
> PluginRepository.getExtensionPoint(String) -> getExtensionPoint
> (String, NutchConf)
We don't need it here as well, since we use NutchConf until regestry
instantiation.
The other case would be that we have to build up the plugin
dependency graph for each method call.
Would you agree to have a several plugin regestries with may be
different NutchConf's but instantiate extensions with nutchConf but
not query ExtentsionPoints etc?
> etc, etc...
>
> The way this would work would be similar to the mechanism described
> above: if plugin instances are not created yet, they would be
> created once (based on the current NutchConf argument), and then
> cached in this NutchConf instance.
I guess this is difficult.
First we have the plugin class instances, most or may all plugins I
know do not have a plugin class implementation, second we have the
extensions classes that at least do not need to implement a specific
interface from the plugin regestry point of view (only such things
like index filter interface etc.)
Caching plugin class instances makes sense since actually there is
only one plugin class instance per plugin in the jvm. However there
will be many instances for each extension class, since e.g. the
parser or protocoll runs multithreaded.
>
> And also the plugin implementations would have to extend
> NutchConfigured, taking NutchConf as the argument to their
> constructors - because now the Extension.getExtensionInstance would
> pass the current NutchConf instance to their contructors.
In general my point of view is that:
In case we touch this issue anyway I would love to do a radical
solution, since i have a other understanding of handling parameters
than collect them in a kind of map and make the map general accessible.
Instead of giving any object access to the configuration object and
handle properties like a bazar I would prefer handle configuration
only in the first object in the stack, that would be in our case for
example the indexing tool.
Than the indexing tool instantiate the plugin registry only with the
required properties that would be part of the constructor, e.g.
pluginFolders, include, exlude reg ex and autoactivation flag.
Later the extension instances can be also get some more values
injected, but has in general no access to the configuration object.
This would first of all make things better testable but also allows
much much more flexibility to run several different fetchers in one
jvm.
Anyway this would be may a imporvement suggestion from me for nutch
2.0 or 3.0 for now we would be some steps forward just changing
NutchConf access to non static style.
I hopefully found some time until next days to do some experiments
and will come back with some more details.
However we should found a general agreement about the way we go,
since changing code in 195 places and lines that depends on it for
just nothing is not that funny.
Stefan
Re: Static initializers
Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
>Andrzej,
>
>How do you choose the NutchConf to use ?
>
>
It is provided as an argument to all constructors.
>Here is a short discussion I had with Doug about a kind of dynamic NutchConf
>inside the same JVM:
>
>"... By looking at the mailing lists archives it seems that having some
>behavior depending on the documents URL is a recurrent problem (for instance
>for boosting documents matching a url pattern - NUTCH-16 issue, and many
>other topics).
>So, our idea is to provide a way to provide a "dynamic" nutch configuration
>(that override the default one, like for the nutch-site) based on documents
>matching urls pattern. The idea is as follow:
>
>
Well, it's a neat idea, but it's not necessarily what I was proposing.
My proposal could be the first step to implement this.
>1. The default configuration is as usualy the nutch-default.xml file
>
>2. An xml file can map some url regexp to some many others configurations
>files (that override the nutch-default):
><nutch:conf>
> <url regexp="http://www.mydomain1.com/*">
> <!-- A set of nutch properties that override the nutch-default for this
>domain -->
> <property>
> <name>property1</name>
> <value>value1</name>
> </property>
> ....
> </url>
> ....
></nutch:conf>"
>
>What do you think about this?
>
>
Yes, if you can specify different configs for every run, or even for
every invocation, it's certainly possible.
>
>Looking deeper, this is more messy that I thought... Some changes would
>
>
>>be required to the plugin instantiation mechanisms, e.g.:
>>
>> Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
>> ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
>> PluginRepository.getExtensionPoint(String) ->
>>getExtensionPoint(String, NutchConf)
>>
>>etc, etc...
>>
>>The way this would work would be similar to the mechanism described
>>above: if plugin instances are not created yet, they would be created
>>once (based on the current NutchConf argument), and then cached in this
>>NutchConf instance.
>>
>>And also the plugin implementations would have to extend
>>NutchConfigured, taking NutchConf as the argument to their constructors
>>- because now the Extension.getExtensionInstance would pass the current
>>NutchConf instance to their contructors.
>>
>>
>
>That's exactly what I had in mind while speaking about a dynamic NutchConf
>with Doug.
>For me it's a +1
>The only think I don't really like is extending the NutchConfigured, but it
>is the most secured way to implement it.
>
>
Well, it's a form of enforcing a contract for the constructors. There is
no other way to do it in Java - you can't specify the required
constructors in an interface. OTOH you have the NutchConfigurable
interface, which we could use instead, but then you have to remember to
call setConf() before you do anything else...
I'll work on this to see where it leads.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Static initializers
Posted by Jérôme Charron <je...@gmail.com>.
Andrzej,
How do you choose the NutchConf to use ?
Here is a short discussion I had with Doug about a kind of dynamic NutchConf
inside the same JVM:
"... By looking at the mailing lists archives it seems that having some
behavior depending on the documents URL is a recurrent problem (for instance
for boosting documents matching a url pattern - NUTCH-16 issue, and many
other topics).
So, our idea is to provide a way to provide a "dynamic" nutch configuration
(that override the default one, like for the nutch-site) based on documents
matching urls pattern. The idea is as follow:
1. The default configuration is as usualy the nutch-default.xml file
2. An xml file can map some url regexp to some many others configurations
files (that override the nutch-default):
<nutch:conf>
<url regexp="http://www.mydomain1.com/*">
<!-- A set of nutch properties that override the nutch-default for this
domain -->
<property>
<name>property1</name>
<value>value1</name>
</property>
....
</url>
....
</nutch:conf>"
What do you think about this?
Looking deeper, this is more messy that I thought... Some changes would
> be required to the plugin instantiation mechanisms, e.g.:
>
> Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
> ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
> PluginRepository.getExtensionPoint(String) ->
> getExtensionPoint(String, NutchConf)
>
> etc, etc...
>
> The way this would work would be similar to the mechanism described
> above: if plugin instances are not created yet, they would be created
> once (based on the current NutchConf argument), and then cached in this
> NutchConf instance.
>
> And also the plugin implementations would have to extend
> NutchConfigured, taking NutchConf as the argument to their constructors
> - because now the Extension.getExtensionInstance would pass the current
> NutchConf instance to their contructors.
That's exactly what I had in mind while speaking about a dynamic NutchConf
with Doug.
For me it's a +1
The only think I don't really like is extending the NutchConfigured, but it
is the most secured way to implement it.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Re: Static initializers
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:
> URLFilters:
>
> private URLFilters(NutchConf) {
> // initialize plugins based on this instance of NutchConf
> }
>
> public static URLFilters get(NutchConf conf) {
> URLFilters res = (URLFilters)conf.get("urlfilters.key");
> if (res == null) {
> res = new URLFilters(conf);
> conf.put("urlfilters.key", res);
> }
> return res;
> }
>
Looking deeper, this is more messy that I thought... Some changes would
be required to the plugin instantiation mechanisms, e.g.:
Extension.getExtensionInstance() -> getExtensionInstance(NutchConf)
ExtensionPoint.getExtensions() -> getExtensions(NutchConf)
PluginRepository.getExtensionPoint(String) ->
getExtensionPoint(String, NutchConf)
etc, etc...
The way this would work would be similar to the mechanism described
above: if plugin instances are not created yet, they would be created
once (based on the current NutchConf argument), and then cached in this
NutchConf instance.
And also the plugin implementations would have to extend
NutchConfigured, taking NutchConf as the argument to their constructors
- because now the Extension.getExtensionInstance would pass the current
NutchConf instance to their contructors.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Static initializers
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
right this is a know problem and discussed several times, we should
start solving this. :-)
I suggest that we make the Plugin Class implementing the Configurable
interface. In case a plugin needs any configuration value it will
request them from the plugin instance.
The next step would be changing the plugin Registry from a singleton
to a normal object that need to be instantiated with a nutch
configuration object in the constructor.
In general I suggest we use a Inversion of control style mechanism
(http://www.martinfowler.com/articles/injection.html) to solve these
kind of problems, this is from my point of view the cleanest possible
solution and allows also changing e.g. configuration objects until
runtime.
Stefan
Am 20.12.2005 um 14:19 schrieb Andrzej Bialecki:
> Hi,
>
> This was mentioned before: there are many places in Nutch that rely
> on static initializers. This is so-so or sometimes plainly bad,
> depending on a situation.
>
> I'm facing a problem now with URLFilters. I need to run several
> fetchers inside a single VM, with different parameters such as
> different url patterns (which is handled by URLFilters). But even
> if I specify different NutchConf-s to each fetcher, the list of
> implementations and the instances of URLFilter[] in URLFilters are
> initialized only once, and this happens from the default
> configuration obtained through a call to static NutchConf.get().
>
> I would like to change it somehow, but I'm not sure how... One way
> to solve this would be to instantiate the plugins based on a
> concrete NutchConf instance, like this:
>
> URLFilters:
>
> private URLFilters(NutchConf) {
> // initialize plugins based on this instance of NutchConf
> }
>
> public static URLFilters get(NutchConf conf) {
> URLFilters res = (URLFilters)conf.get("urlfilters.key");
> if (res == null) {
> res = new URLFilters(conf);
> conf.put("urlfilters.key", res);
> }
> return res;
> }
>
> In case you are running with a single NutchConf per JVM it doesn't
> change anything. In case you want to run several different configs
> in a single JVM this approach provides the solution. We could
> follow this strategy for other plugin registry facades. Comments?
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>