You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Elliot West <te...@gmail.com> on 2016/04/27 15:43:52 UTC

DistCp CRC failure modes

Hello,

We are using DistCp V2 to replicate data between two HDFS file systems. We
were working on the assumption that we could rely on CRC checks to ensure
that the data was replicated correctly. However, after examining the DistCp
source code it seems that there are edge cases where the CRCs could differ
and yet the copy succeeds even when we are not skipping CRC checks.

I'm wondering whether this is by design and if so, the reasoning behind it?
If this is a bug, I'd like to raise an issue to fix it. If it is by design,
I'd like to propose the introduction an option for stricter CRC checks.

The code in question is contained in the method:

org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)

which can be seen here:

https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457


Specifically this code block suggests that if there is a failure when
trying to read the source or target checksum then the method will return
'true', implying that the check succeeded. In actual fact we just failed to
obtain the checksum and could perform no check.

    try {
      sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
          .getFileChecksum(source);
      targetChecksum = targetFS.getFileChecksum(target);
    } catch (IOException e) {
      LOG.error("Unable to retrieve checksum for " + source + " or " +
target, e);
    }
    return (sourceChecksum == null || targetChecksum == null ||
            sourceChecksum.equals(targetChecksum));

Ideally I'd like to be able to configure a check where we require that both
the source and target CRCs are retrieved and compared, and if for any
reason either of the CRCs retrievals fail then an exception is thrown. I do
appreciate that some FileSystems cannot return CRCs but these could still
be handled correctly as they would simply return null and not throw an
exception (I assume).

I'd appreciate any thoughts on this matter.

Elliot.

Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
(Added hdfs-dev ML)

Thanks Elliot for reporting this issue.

I'm thinking this is not by design, so we should fix it.
Would you file a JIRA for this issue?
https://issues.apache.org/jira/browse/HDFS/

If you don't have time to do so, I'll file it on behalf of you.

Regards,
Akira

On 4/27/16 22:43, Elliot West wrote:
> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems.
> We were working on the assumption that we could rely on CRC checks to
> ensure that the data was replicated correctly. However, after examining
> the DistCp source code it seems that there are edge cases where the CRCs
> could differ and yet the copy succeeds even when we are not skipping CRC
> checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
>     org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>     https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed
> to obtain the checksum and could perform no check.
>
>      try {
>        sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>            .getFileChecksum(source);
>        targetChecksum = targetFS.getFileChecksum(target);
>      } catch (IOException e) {
>        LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>      }
>      return (sourceChecksum == null || targetChecksum == null ||
>              sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for
> any reason either of the CRCs retrievals fail then an exception is
> thrown. I do appreciate that some FileSystems cannot return CRCs but
> these could still be handled correctly as they would simply return null
> and not throw an exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
(Added hdfs-dev ML)

Thanks Elliot for reporting this issue.

I'm thinking this is not by design, so we should fix it.
Would you file a JIRA for this issue?
https://issues.apache.org/jira/browse/HDFS/

If you don't have time to do so, I'll file it on behalf of you.

Regards,
Akira

On 4/27/16 22:43, Elliot West wrote:
> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems.
> We were working on the assumption that we could rely on CRC checks to
> ensure that the data was replicated correctly. However, after examining
> the DistCp source code it seems that there are edge cases where the CRCs
> could differ and yet the copy succeeds even when we are not skipping CRC
> checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
>     org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>     https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed
> to obtain the checksum and could perform no check.
>
>      try {
>        sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>            .getFileChecksum(source);
>        targetChecksum = targetFS.getFileChecksum(target);
>      } catch (IOException e) {
>        LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>      }
>      return (sourceChecksum == null || targetChecksum == null ||
>              sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for
> any reason either of the CRCs retrievals fail then an exception is
> thrown. I do appreciate that some FileSystems cannot return CRCs but
> these could still be handled correctly as they would simply return null
> and not throw an exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
(Added hdfs-dev ML)

Thanks Elliot for reporting this issue.

I'm thinking this is not by design, so we should fix it.
Would you file a JIRA for this issue?
https://issues.apache.org/jira/browse/HDFS/

If you don't have time to do so, I'll file it on behalf of you.

Regards,
Akira

On 4/27/16 22:43, Elliot West wrote:
> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems.
> We were working on the assumption that we could rely on CRC checks to
> ensure that the data was replicated correctly. However, after examining
> the DistCp source code it seems that there are edge cases where the CRCs
> could differ and yet the copy succeeds even when we are not skipping CRC
> checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
>     org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>     https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed
> to obtain the checksum and could perform no check.
>
>      try {
>        sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>            .getFileChecksum(source);
>        targetChecksum = targetFS.getFileChecksum(target);
>      } catch (IOException e) {
>        LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>      }
>      return (sourceChecksum == null || targetChecksum == null ||
>              sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for
> any reason either of the CRCs retrievals fail then an exception is
> thrown. I do appreciate that some FileSystems cannot return CRCs but
> these could still be handled correctly as they would simply return null
> and not throw an exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.


Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
Thank you, Elliot!

On 4/28/16 03:40, Elliot West wrote:
> I've raised this as an issue:
>
> https://issues.apache.org/jira/browse/HDFS-10338
>
> On Wednesday, 27 April 2016, Elliot West <teabot@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     We are using DistCp V2 to replicate data between two HDFS file
>     systems. We were working on the assumption that we could rely on CRC
>     checks to ensure that the data was replicated correctly. However,
>     after examining the DistCp source code it seems that there are edge
>     cases where the CRCs could differ and yet the copy succeeds even
>     when we are not skipping CRC checks.
>
>     I'm wondering whether this is by design and if so, the reasoning
>     behind it? If this is a bug, I'd like to raise an issue to fix it.
>     If it is by design, I'd like to propose the introduction an option
>     for stricter CRC checks.
>
>     The code in question is contained in the method:
>
>         org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
>     which can be seen here:
>
>         https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
>     Specifically this code block suggests that if there is a failure
>     when trying to read the source or target checksum then the method
>     will return 'true', implying that the check succeeded. In actual
>     fact we just failed to obtain the checksum and could perform no check.
>
>          try {
>            sourceChecksum = sourceChecksum != null ? sourceChecksum :
>     sourceFS
>                .getFileChecksum(source);
>            targetChecksum = targetFS.getFileChecksum(target);
>          } catch (IOException e) {
>            LOG.error("Unable to retrieve checksum for " + source + " or
>     " + target, e);
>          }
>          return (sourceChecksum == null || targetChecksum == null ||
>                  sourceChecksum.equals(targetChecksum));
>
>     Ideally I'd like to be able to configure a check where we require
>     that both the source and target CRCs are retrieved and compared, and
>     if for any reason either of the CRCs retrievals fail then an
>     exception is thrown. I do appreciate that some FileSystems cannot
>     return CRCs but these could still be handled correctly as they would
>     simply return null and not throw an exception (I assume).
>
>     I'd appreciate any thoughts on this matter.
>
>     Elliot.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
Thank you, Elliot!

On 4/28/16 03:40, Elliot West wrote:
> I've raised this as an issue:
>
> https://issues.apache.org/jira/browse/HDFS-10338
>
> On Wednesday, 27 April 2016, Elliot West <teabot@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     We are using DistCp V2 to replicate data between two HDFS file
>     systems. We were working on the assumption that we could rely on CRC
>     checks to ensure that the data was replicated correctly. However,
>     after examining the DistCp source code it seems that there are edge
>     cases where the CRCs could differ and yet the copy succeeds even
>     when we are not skipping CRC checks.
>
>     I'm wondering whether this is by design and if so, the reasoning
>     behind it? If this is a bug, I'd like to raise an issue to fix it.
>     If it is by design, I'd like to propose the introduction an option
>     for stricter CRC checks.
>
>     The code in question is contained in the method:
>
>         org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
>     which can be seen here:
>
>         https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
>     Specifically this code block suggests that if there is a failure
>     when trying to read the source or target checksum then the method
>     will return 'true', implying that the check succeeded. In actual
>     fact we just failed to obtain the checksum and could perform no check.
>
>          try {
>            sourceChecksum = sourceChecksum != null ? sourceChecksum :
>     sourceFS
>                .getFileChecksum(source);
>            targetChecksum = targetFS.getFileChecksum(target);
>          } catch (IOException e) {
>            LOG.error("Unable to retrieve checksum for " + source + " or
>     " + target, e);
>          }
>          return (sourceChecksum == null || targetChecksum == null ||
>                  sourceChecksum.equals(targetChecksum));
>
>     Ideally I'd like to be able to configure a check where we require
>     that both the source and target CRCs are retrieved and compared, and
>     if for any reason either of the CRCs retrievals fail then an
>     exception is thrown. I do appreciate that some FileSystems cannot
>     return CRCs but these could still be handled correctly as they would
>     simply return null and not throw an exception (I assume).
>
>     I'd appreciate any thoughts on this matter.
>
>     Elliot.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
Thank you, Elliot!

On 4/28/16 03:40, Elliot West wrote:
> I've raised this as an issue:
>
> https://issues.apache.org/jira/browse/HDFS-10338
>
> On Wednesday, 27 April 2016, Elliot West <teabot@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     We are using DistCp V2 to replicate data between two HDFS file
>     systems. We were working on the assumption that we could rely on CRC
>     checks to ensure that the data was replicated correctly. However,
>     after examining the DistCp source code it seems that there are edge
>     cases where the CRCs could differ and yet the copy succeeds even
>     when we are not skipping CRC checks.
>
>     I'm wondering whether this is by design and if so, the reasoning
>     behind it? If this is a bug, I'd like to raise an issue to fix it.
>     If it is by design, I'd like to propose the introduction an option
>     for stricter CRC checks.
>
>     The code in question is contained in the method:
>
>         org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
>     which can be seen here:
>
>         https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
>     Specifically this code block suggests that if there is a failure
>     when trying to read the source or target checksum then the method
>     will return 'true', implying that the check succeeded. In actual
>     fact we just failed to obtain the checksum and could perform no check.
>
>          try {
>            sourceChecksum = sourceChecksum != null ? sourceChecksum :
>     sourceFS
>                .getFileChecksum(source);
>            targetChecksum = targetFS.getFileChecksum(target);
>          } catch (IOException e) {
>            LOG.error("Unable to retrieve checksum for " + source + " or
>     " + target, e);
>          }
>          return (sourceChecksum == null || targetChecksum == null ||
>                  sourceChecksum.equals(targetChecksum));
>
>     Ideally I'd like to be able to configure a check where we require
>     that both the source and target CRCs are retrieved and compared, and
>     if for any reason either of the CRCs retrievals fail then an
>     exception is thrown. I do appreciate that some FileSystems cannot
>     return CRCs but these could still be handled correctly as they would
>     simply return null and not throw an exception (I assume).
>
>     I'd appreciate any thoughts on this matter.
>
>     Elliot.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
Thank you, Elliot!

On 4/28/16 03:40, Elliot West wrote:
> I've raised this as an issue:
>
> https://issues.apache.org/jira/browse/HDFS-10338
>
> On Wednesday, 27 April 2016, Elliot West <teabot@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     We are using DistCp V2 to replicate data between two HDFS file
>     systems. We were working on the assumption that we could rely on CRC
>     checks to ensure that the data was replicated correctly. However,
>     after examining the DistCp source code it seems that there are edge
>     cases where the CRCs could differ and yet the copy succeeds even
>     when we are not skipping CRC checks.
>
>     I'm wondering whether this is by design and if so, the reasoning
>     behind it? If this is a bug, I'd like to raise an issue to fix it.
>     If it is by design, I'd like to propose the introduction an option
>     for stricter CRC checks.
>
>     The code in question is contained in the method:
>
>         org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
>     which can be seen here:
>
>         https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
>     Specifically this code block suggests that if there is a failure
>     when trying to read the source or target checksum then the method
>     will return 'true', implying that the check succeeded. In actual
>     fact we just failed to obtain the checksum and could perform no check.
>
>          try {
>            sourceChecksum = sourceChecksum != null ? sourceChecksum :
>     sourceFS
>                .getFileChecksum(source);
>            targetChecksum = targetFS.getFileChecksum(target);
>          } catch (IOException e) {
>            LOG.error("Unable to retrieve checksum for " + source + " or
>     " + target, e);
>          }
>          return (sourceChecksum == null || targetChecksum == null ||
>                  sourceChecksum.equals(targetChecksum));
>
>     Ideally I'd like to be able to configure a check where we require
>     that both the source and target CRCs are retrieved and compared, and
>     if for any reason either of the CRCs retrievals fail then an
>     exception is thrown. I do appreciate that some FileSystems cannot
>     return CRCs but these could still be handled correctly as they would
>     simply return null and not throw an exception (I assume).
>
>     I'd appreciate any thoughts on this matter.
>
>     Elliot.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Elliot West <te...@gmail.com>.
I've raised this as an issue:

https://issues.apache.org/jira/browse/HDFS-10338

On Wednesday, 27 April 2016, Elliot West <te...@gmail.com> wrote:

> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems. We
> were working on the assumption that we could rely on CRC checks to ensure
> that the data was replicated correctly. However, after examining the DistCp
> source code it seems that there are edge cases where the CRCs could differ
> and yet the copy succeeds even when we are not skipping CRC checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
> org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>
> https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed to
> obtain the checksum and could perform no check.
>
>     try {
>       sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>           .getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>     }
>     return (sourceChecksum == null || targetChecksum == null ||
>             sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for any
> reason either of the CRCs retrievals fail then an exception is thrown. I do
> appreciate that some FileSystems cannot return CRCs but these could still
> be handled correctly as they would simply return null and not throw an
> exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.
>

Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
(Added hdfs-dev ML)

Thanks Elliot for reporting this issue.

I'm thinking this is not by design, so we should fix it.
Would you file a JIRA for this issue?
https://issues.apache.org/jira/browse/HDFS/

If you don't have time to do so, I'll file it on behalf of you.

Regards,
Akira

On 4/27/16 22:43, Elliot West wrote:
> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems.
> We were working on the assumption that we could rely on CRC checks to
> ensure that the data was replicated correctly. However, after examining
> the DistCp source code it seems that there are edge cases where the CRCs
> could differ and yet the copy succeeds even when we are not skipping CRC
> checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
>     org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>     https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed
> to obtain the checksum and could perform no check.
>
>      try {
>        sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>            .getFileChecksum(source);
>        targetChecksum = targetFS.getFileChecksum(target);
>      } catch (IOException e) {
>        LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>      }
>      return (sourceChecksum == null || targetChecksum == null ||
>              sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for
> any reason either of the CRCs retrievals fail then an exception is
> thrown. I do appreciate that some FileSystems cannot return CRCs but
> these could still be handled correctly as they would simply return null
> and not throw an exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Elliot West <te...@gmail.com>.
I've raised this as an issue:

https://issues.apache.org/jira/browse/HDFS-10338

On Wednesday, 27 April 2016, Elliot West <te...@gmail.com> wrote:

> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems. We
> were working on the assumption that we could rely on CRC checks to ensure
> that the data was replicated correctly. However, after examining the DistCp
> source code it seems that there are edge cases where the CRCs could differ
> and yet the copy succeeds even when we are not skipping CRC checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
> org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>
> https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed to
> obtain the checksum and could perform no check.
>
>     try {
>       sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>           .getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>     }
>     return (sourceChecksum == null || targetChecksum == null ||
>             sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for any
> reason either of the CRCs retrievals fail then an exception is thrown. I do
> appreciate that some FileSystems cannot return CRCs but these could still
> be handled correctly as they would simply return null and not throw an
> exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.
>

Re: DistCp CRC failure modes

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.
(Added hdfs-dev ML)

Thanks Elliot for reporting this issue.

I'm thinking this is not by design, so we should fix it.
Would you file a JIRA for this issue?
https://issues.apache.org/jira/browse/HDFS/

If you don't have time to do so, I'll file it on behalf of you.

Regards,
Akira

On 4/27/16 22:43, Elliot West wrote:
> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems.
> We were working on the assumption that we could rely on CRC checks to
> ensure that the data was replicated correctly. However, after examining
> the DistCp source code it seems that there are edge cases where the CRCs
> could differ and yet the copy succeeds even when we are not skipping CRC
> checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
>     org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>     https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed
> to obtain the checksum and could perform no check.
>
>      try {
>        sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>            .getFileChecksum(source);
>        targetChecksum = targetFS.getFileChecksum(target);
>      } catch (IOException e) {
>        LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>      }
>      return (sourceChecksum == null || targetChecksum == null ||
>              sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for
> any reason either of the CRCs retrievals fail then an exception is
> thrown. I do appreciate that some FileSystems cannot return CRCs but
> these could still be handled correctly as they would simply return null
> and not throw an exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: DistCp CRC failure modes

Posted by Elliot West <te...@gmail.com>.
I've raised this as an issue:

https://issues.apache.org/jira/browse/HDFS-10338

On Wednesday, 27 April 2016, Elliot West <te...@gmail.com> wrote:

> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems. We
> were working on the assumption that we could rely on CRC checks to ensure
> that the data was replicated correctly. However, after examining the DistCp
> source code it seems that there are edge cases where the CRCs could differ
> and yet the copy succeeds even when we are not skipping CRC checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
> org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>
> https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed to
> obtain the checksum and could perform no check.
>
>     try {
>       sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>           .getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>     }
>     return (sourceChecksum == null || targetChecksum == null ||
>             sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for any
> reason either of the CRCs retrievals fail then an exception is thrown. I do
> appreciate that some FileSystems cannot return CRCs but these could still
> be handled correctly as they would simply return null and not throw an
> exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.
>

Re: DistCp CRC failure modes

Posted by Elliot West <te...@gmail.com>.
I've raised this as an issue:

https://issues.apache.org/jira/browse/HDFS-10338

On Wednesday, 27 April 2016, Elliot West <te...@gmail.com> wrote:

> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems. We
> were working on the assumption that we could rely on CRC checks to ensure
> that the data was replicated correctly. However, after examining the DistCp
> source code it seems that there are edge cases where the CRCs could differ
> and yet the copy succeeds even when we are not skipping CRC checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
> org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>
> https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed to
> obtain the checksum and could perform no check.
>
>     try {
>       sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>           .getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>     }
>     return (sourceChecksum == null || targetChecksum == null ||
>             sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for any
> reason either of the CRCs retrievals fail then an exception is thrown. I do
> appreciate that some FileSystems cannot return CRCs but these could still
> be handled correctly as they would simply return null and not throw an
> exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.
>