You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Toth, Attila" <At...@momentum.com> on 2012/09/27 18:20:47 UTC

Is SFTP supported / working?

Hi,

Has anyone been able to use SFTP with Nutch 2.0?

* I have enabled the out-of-the-box SFTP plugin in nutch-site.xml / plugin.includes property
* I have added the appropriate line to prefix-urlfilter.txt
* I configured Nutch to accept everything in regex-urlfilter.txt
* I am trying to inject a single URL with SFTP to a clean HBase / Nutch / Solr setup

I consider my setup working properly otherwise since I am able to inject / generate / fetch / parse / etc. a sample of 1,000 URLs from the DMOZ Open Directory (similar to the Nutch 1.x tutorial).

Here is the output of the inject command:

InjectorJob: starting
InjectorJob: urlDir: ***censored***
Skipping sftp://***censored***/:java.net.MalformedURLException: unknown protocol: sftp
InjectorJob: finished

Here is the related snippet from the log file with TRACE level:

2012-09-27 11:21:50,874 DEBUG plugin.PluginRepository - parsing: /home/totha/development/apache-nutch-2.0/plugins/protocol-sftp/plugin.xml
2012-09-27 11:21:50,875 DEBUG plugin.PluginRepository - plugin: id=protocol-sftp name=Sftp Protocol Plug-in version=1.0.0 provider=nutch.orgclass=null
2012-09-27 11:21:50,875 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.sftp.Sftp
...
2012-09-27 11:21:50,880 INFO  plugin.PluginRepository - Registered Plugins:
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Sftp Protocol Plug-in (protocol-sftp)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Tika Parser Plug-in (parse-tika)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Anchor Indexing Filter (index-anchor)
2012-09-27 11:21:50,881 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)

Thanks.
IMPORTANT NOTICE: This message, including attachments, may be confidential or legally privileged and is for the intended recipient(s) only. Unauthorized distribution, copying or disclosure is strictly prohibited. By accepting email communications that may contain your personal information, you are deemed to consent to its transmission. Please delete this email if obtained in error and email confirmation to sender.

RE: Is SFTP supported / working?

Posted by "Toth, Attila" <At...@momentum.com>.
Hm, I did not see any problems during the build process...

I have double-checked the build setup for the SFTP plugin but did not find any issues - everything looks fine as far as I am concerned. I clean-built the whole project with extra attention to possible issues but it came out clean (the javadoc task complained about some missing references but it does not matter).

The jsch package is getting fetched and used, it also shows up in the distribution.

I believe Nutch actually recognizes the SFTP plugin at startup (according to its log, at least) it just does not get properly associated with actual requests...



________________________________________
From: Lewis John Mcgibbney [lewis.mcgibbney@gmail.com]
Sent: Thursday, September 27, 2012 12:41 PM
To: user@nutch.apache.org
Subject: Re: Is SFTP supported / working?

Hi,

AFAIK this plugin has not been used extensively with Nutch 2.x however
here are some of my early observations which should get it working.

1. The plugin's plugin.xml and java source quotes code from the jsch
package [0] so you will need to grab that and make it available...
please see below
2. Have a look at the parse-tika plugin [1] for how plugin specific
dependencies can be fetched from maven central. This is not
particularly complicated and you should be able to get it working
pretty easily.

If you are able to get things working then great, please submit a
patch to the Nutch Jira if possible. In the meantime I've opened a
ticket [2] to track the progress, if you could attach here that would
be excellent.

Lewis

[0] http://www.jcraft.com/jsch/
[1] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/parse-tika/ivy.xml
[2] https://issues.apache.org/jira/browse/NUTCH-1474

On Thu, Sep 27, 2012 at 5:20 PM, Toth, Attila <At...@momentum.com> wrote:
IMPORTANT NOTICE: This message, including attachments, may be confidential or legally privileged and is for the intended recipient(s) only. Unauthorized distribution, copying or disclosure is strictly prohibited. By accepting email communications that may contain your personal information, you are deemed to consent to its transmission. Please delete this email if obtained in error and email confirmation to sender.

Re: Is SFTP supported / working?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

AFAIK this plugin has not been used extensively with Nutch 2.x however
here are some of my early observations which should get it working.

1. The plugin's plugin.xml and java source quotes code from the jsch
package [0] so you will need to grab that and make it available...
please see below
2. Have a look at the parse-tika plugin [1] for how plugin specific
dependencies can be fetched from maven central. This is not
particularly complicated and you should be able to get it working
pretty easily.

If you are able to get things working then great, please submit a
patch to the Nutch Jira if possible. In the meantime I've opened a
ticket [2] to track the progress, if you could attach here that would
be excellent.

Lewis

[0] http://www.jcraft.com/jsch/
[1] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/parse-tika/ivy.xml
[2] https://issues.apache.org/jira/browse/NUTCH-1474

On Thu, Sep 27, 2012 at 5:20 PM, Toth, Attila <At...@momentum.com> wrote: