You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/11/05 20:49:00 UTC

[jira] [Commented] (NUTCH-2453) FTP protocol seems to have issues running multithreaded

    [ https://issues.apache.org/jira/browse/NUTCH-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239719#comment-16239719 ] 

Sebastian Nagel commented on NUTCH-2453:
----------------------------------------

Hi [~hiran], protocol-ftp is one of the oldest plugins (and now one of those used rarely. It may be the case that it's not thread-safe. Thanks, for reporting this issue!

> FTP protocol seems to have issues running multithreaded
> -------------------------------------------------------
>
>                 Key: NUTCH-2453
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2453
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.13
>         Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>            Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas. Also I wanted to increase crawl speed and thus configured fetcher.threads.per.queue=10 in nutch-site.xml.
> As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this setup I saw such messages in the logs:
> {{2017-10-25 22:52:54,699 WARN  org.apache.nutch.protocol.ftp.Ftp - ftp.client.login() failed: nas/192.168.178.43
> 2017-10-25 22:52:54,718 WARN  org.apache.nutch.protocol.ftp.Ftp - Error:
> java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>         at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>         at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>         at java.io.InputStreamReader.read(InputStreamReader.java:184)
>         at java.io.BufferedReader.fill(BufferedReader.java:161)
>         at java.io.BufferedReader.read(BufferedReader.java:182)
>         at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
>         at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
>         at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
>         at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
>         at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
>         at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
>         at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> 2017-10-25 22:52:54,721 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/hiran/Desktop/Segelclub.txt~
> org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket closed
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>         at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>         at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>         at java.io.InputStreamReader.read(InputStreamReader.java:184)
>         at java.io.BufferedReader.fill(BufferedReader.java:161)
>         at java.io.BufferedReader.read(BufferedReader.java:182)
>         at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
>         at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
>         at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
>         at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
>         at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
>         at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
>         at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
>         ... 2 more
> 2017-10-25 22:52:54,730 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/hiran/svn/glib-2.2.3/tests/cxx-test.C
> org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket closed
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>         at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>         at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>         at java.io.InputStreamReader.read(InputStreamReader.java:184)
>         at java.io.BufferedReader.fill(BufferedReader.java:161)
>         at java.io.BufferedReader.read(BufferedReader.java:182)
>         at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
>         at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
>         at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
>         at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
>         at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
>         at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
>         at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
>         ... 2 more
> 2017-10-25 22:52:54,734 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/usr/include/asm-generic/shmparam.h
> org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket is not connected
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.net.SocketException: Socket is not connected
>         at java.net.Socket.getInputStream(Socket.java:905)
>         at org.apache.commons.net.SocketClient._connectAction_(SocketClient.java:143)
>         at org.apache.commons.net.ftp.FTP._connectAction_(FTP.java:374)
>         at org.apache.commons.net.SocketClient.connect(SocketClient.java:172)
>         at org.apache.commons.net.SocketClient.connect(SocketClient.java:266)
>         at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:175)
>         ... 2 more
> 2017-10-25 22:52:54,744 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/home/hiran/.compiz/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 500
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please note that all these problems vanished when I configured fetcher.threads.per.queue back to 1 (the default value).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)