You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Breno Faria <br...@intrafind.de> on 2015/06/01 09:28:03 UTC

AW: Deduplication -- custom Signature

Hi Lewis,

Thanks for your prompt answer!

> Nice! Out of curiosity can you share what your use case is? I would be really interested to hear more as I am interested in domain modeling and deduplication is part of this.

We are indexing several domains for a specific project, which may contain duplicated content (e.g. pdf files). The users of the system come from different organisations and wonder why the content is not appearing under certain domains. It's a usability issue (with a political aftertaste).

> Absolutely yes. I just need to know more about the structure of the Signature implementation. Did you extend Signature [0]? If so then it will be packaged along with the Nutch .job jar and can be invoked via the following property [1] db.signature.class

Yes, I extended Signature, and I'm also able to use it through the db.signature.class property, if I pack the class into its own jar and put it into nutch/lib. I'd much rather like to include it in our existing plugin jar, though. I'm not sure what you mean by ".job jar.". We have been developing our plugin outside of nutch and placing the corresponding jars into a plugin directory together with the plugin.xml. Is there any "magic" happening regarding the classpath when one has ant building it inside nutch? Is there a naming convention regarding the plugin name and corresponding jar? Do they have to match?

The reason behind developing our plugin outside of nutch and decoupling the build environment is to make updates of nutch easier. That way we can simply download the binary release and overlay our plugin. I realize now this seems to be a little off the usual way of writing plugins for nutch.

Best regards,


Breno Faria
Software Architect – Text Analytics
Intrafind Software AG
Tel:      +49 (89) 3090446-26 
Web:    http://www.intrafind.de

-----Ursprüngliche Nachricht-----
Von: Lewis John Mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Gesendet: Sonntag, 31. Mai 2015 20:47
An: user@nutch.apache.org
Betreff: Re: Deduplication -- custom Signature

Hi Breno,

On Sun, May 31, 2015 at 12:30 AM, <us...@nutch.apache.org> wrote:

>
> I've implemented a custom domain aware Signature to be used in the 
> deduplication phase.
>



>
> Is there a way of including this custom class in a plugin jar?


Absolutely yes. I just need to know more about the structure of the Signature implementation. Did you extend Signature [0]? If so then it will be packaged along with the Nutch .job jar and can be invoked via the following property [1] db.signature.class

[0]
http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/Signature.html
[1]
https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L650-L656



> Nutch is not finding the class, and I suspect it lies on the fact that 
> the plugins are loaded by separate class loaders.
>
>
You may be good invoking this as part of the 'core' codebase instead of as a plugin implementation. Please let us know based upon the above.
Thanks
Lewis