You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Volker Janz <vo...@innogames.com> on 2015/04/29 14:51:02 UTC
hdfs-bolt write/sync problems
Hi,
we are using the storm-hdfs bolt (0.9.4) to write data from Kafka to
Hadoop (Hadoop 2.5.0-cdh5.2.0).
This works fine for us but we discovered some unexpected behavior:
Our bolt uses the TimedRotationPolicy to rotate finished files from one
location within HDFS to another. Unfortunately, there are some files
that remain within the "writing" location and do not get rotated, as the
following list shows (I performed this command today and our rotation
policy is set to 180 seconds):
/hadoop fs -ls /tmp/storm-events/valid/collecting | grep "\-25"/
-rw-r--r-- 3 storm storm 20512704 2015-04-25 12:41
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-16-2-1429965520003.txt
-rw-r--r-- 3 storm storm 5559950 2015-04-25 12:32
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-16-270-1429965058462.txt
-rw-r--r-- 3 storm storm 4174336 2015-04-25 00:00
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-16-769-1429916336332.txt
-rw-r--r-- 3 storm storm 125230972 2015-04-25 12:43
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-0-1429965627846.txt
-rw-r--r-- 3 storm storm 115531743 2015-04-25 12:45
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-0-1429965816167.txt
-rw-r--r-- 3 storm storm 106212613 2015-04-25 12:48
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-0-1429965953513.txt
-rw-r--r-- 3 storm storm 25599779 2015-04-25 12:39
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-1042-1429965476558.txt
-rw-r--r-- 3 storm storm 20513134 2015-04-25 12:41
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-21-2-1429965520003.txt
-rw-r--r-- 3 storm storm 5556055 2015-04-25 12:32
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-21-270-1429965058462.txt
-rw-r--r-- 3 storm storm 4171264 2015-04-25 00:00
/tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-21-769-1429916336335.txt
If you check those files with "hadoop fsck -openforwrite", there are no
open filehandles.
Now, if we have a look at the nimbus ui, there are a lot of failed
tuples (but only on specific workers):
The worker logs gave an explanation of those failures:
/tail -f worker-6704.log/
2015-04-29T11:31:58.337+0000 o.a.s.h.b.HdfsBolt [WARN] write/sync failed.
org.apache.hadoop.ipc.RemoteException:
java.lang.ArrayIndexOutOfBoundsException
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
~[stormjar.jar:0.1.0]
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
~[stormjar.jar:0.1.0]
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
~[stormjar.jar:0.1.0]
at com.sun.proxy.$Proxy8.updatePipeline(Unknown Source) ~[na:na]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[na:1.8.0_31]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[na:1.8.0_31]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[na:1.8.0_31]
at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_31]
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
~[stormjar.jar:0.1.0]
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
~[stormjar.jar:0.1.0]
at com.sun.proxy.$Proxy8.updatePipeline(Unknown Source) ~[na:na]
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updatePipeline(ClientNamenodeProtocolTranslatorPB.java:791)
~[stormjar.jar:0.1.0]
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1047)
~[stormjar.jar:0.1.0]
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
~[stormjar.jar:0.1.0]
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
~[stormjar.jar:0.1.0]
So it seems, that the hdfs-bolt has still an instance of
FSDataOutputStream which points to one of those files but as soon as it
tries to write to or rotate it, this exception occurs. I also had a look
at the hdfs-bolt implementation to find the exact handling of such
problems (https://github.com/ptgoetz/storm-hdfs):
src/main/java/org/apache/storm/hdfs/bolt/HdfsBolt.java:89-118
@Override
public void execute(Tuple tuple) {
try {
[... write and/or rotate ...]
} catch (IOException e) {
LOG.warn("write/sync failed.", e);
this.collector.fail(tuple);
}
}
This handling will just fail the tuple but keep the corrupt
FSDataOutputStream instance. Therefore, those hdfs-bolt instances will
always fail for every tuple. Of course, this does not result into data
loss because the tuple gets reprocessed and might be handled by an
working instance, but still this causes some trouble :-).
Since the exception is not thrown up, we can not handle this issue in
our implementation. It might be a solution to adjust the exception
handling within the hdfs-bolt to renew the FSDataOutputStream instance
in case of an IOException - and still fail the tuple, of course. This
might be useful for other cases and users as well.
The question now is, wheter some of you discovered a similar problem and
whether our solution makes sense?
Thanks a lot and best wishes
Volker
Re: hdfs-bolt write/sync problems
Posted by 马哲超 <ma...@gmail.com>.
Problems may be in the hadoop, like wrong permission.
2015-04-29 20:51 GMT+08:00 Volker Janz <vo...@innogames.com>:
> Hi,
>
> we are using the storm-hdfs bolt (0.9.4) to write data from Kafka to
> Hadoop (Hadoop 2.5.0-cdh5.2.0).
>
> This works fine for us but we discovered some unexpected behavior:
>
> Our bolt uses the TimedRotationPolicy to rotate finished files from one
> location within HDFS to another. Unfortunately, there are some files that
> remain within the "writing" location and do not get rotated, as the
> following list shows (I performed this command today and our rotation
> policy is set to 180 seconds):
>
>
> *hadoop fs -ls /tmp/storm-events/valid/collecting | grep "\-25"*
> -rw-r--r-- 3 storm storm 20512704 2015-04-25 12:41
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-16-2-1429965520003.txt
> -rw-r--r-- 3 storm storm 5559950 2015-04-25 12:32
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-16-270-1429965058462.txt
> -rw-r--r-- 3 storm storm 4174336 2015-04-25 00:00
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-16-769-1429916336332.txt
> -rw-r--r-- 3 storm storm 125230972 2015-04-25 12:43
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-0-1429965627846.txt
> -rw-r--r-- 3 storm storm 115531743 2015-04-25 12:45
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-0-1429965816167.txt
> -rw-r--r-- 3 storm storm 106212613 2015-04-25 12:48
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-0-1429965953513.txt
> -rw-r--r-- 3 storm storm 25599779 2015-04-25 12:39
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-19-1042-1429965476558.txt
> -rw-r--r-- 3 storm storm 20513134 2015-04-25 12:41
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-21-2-1429965520003.txt
> -rw-r--r-- 3 storm storm 5556055 2015-04-25 12:32
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-21-270-1429965058462.txt
> -rw-r--r-- 3 storm storm 4171264 2015-04-25 00:00
> /tmp/storm-events/valid/collecting/events_hdfs-bolt-valid-21-769-1429916336335.txt
>
>
> If you check those files with "hadoop fsck -openforwrite", there are no
> open filehandles.
>
> Now, if we have a look at the nimbus ui, there are a lot of failed tuples
> (but only on specific workers):
>
>
>
> The worker logs gave an explanation of those failures:
>
>
> *tail -f worker-6704.log*
> 2015-04-29T11:31:58.337+0000 o.a.s.h.b.HdfsBolt [WARN] write/sync failed.
> org.apache.hadoop.ipc.RemoteException:
> java.lang.ArrayIndexOutOfBoundsException
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> ~[stormjar.jar:0.1.0]
> at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> ~[stormjar.jar:0.1.0]
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> ~[stormjar.jar:0.1.0]
> at com.sun.proxy.$Proxy8.updatePipeline(Unknown Source) ~[na:na]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.8.0_31]
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[na:1.8.0_31]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.8.0_31]
> at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_31]
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> ~[stormjar.jar:0.1.0]
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> ~[stormjar.jar:0.1.0]
> at com.sun.proxy.$Proxy8.updatePipeline(Unknown Source) ~[na:na]
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updatePipeline(ClientNamenodeProtocolTranslatorPB.java:791)
> ~[stormjar.jar:0.1.0]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1047)
> ~[stormjar.jar:0.1.0]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
> ~[stormjar.jar:0.1.0]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
> ~[stormjar.jar:0.1.0]
>
>
> So it seems, that the hdfs-bolt has still an instance of
> FSDataOutputStream which points to one of those files but as soon as it
> tries to write to or rotate it, this exception occurs. I also had a look at
> the hdfs-bolt implementation to find the exact handling of such problems (
> https://github.com/ptgoetz/storm-hdfs):
>
>
> src/main/java/org/apache/storm/hdfs/bolt/HdfsBolt.java:89-118
>
> @Override
> public void execute(Tuple tuple) {
> try {
> [... write and/or rotate ...]
> } catch (IOException e) {
> LOG.warn("write/sync failed.", e);
> this.collector.fail(tuple);
> }
> }
>
>
> This handling will just fail the tuple but keep the corrupt
> FSDataOutputStream instance. Therefore, those hdfs-bolt instances will
> always fail for every tuple. Of course, this does not result into data loss
> because the tuple gets reprocessed and might be handled by an working
> instance, but still this causes some trouble :-).
>
> Since the exception is not thrown up, we can not handle this issue in our
> implementation. It might be a solution to adjust the exception handling
> within the hdfs-bolt to renew the FSDataOutputStream instance in case of an
> IOException - and still fail the tuple, of course. This might be useful for
> other cases and users as well.
>
> The question now is, wheter some of you discovered a similar problem and
> whether our solution makes sense?
>
> Thanks a lot and best wishes
> Volker
>
>