You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andrey Pankov <ap...@iponweb.net> on 2008/09/04 15:51:48 UTC
Compare data on HDFS side
Hello,
Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.
Thanks in advance.
--
Andrey Pankov
Re: Compare data on HDFS side
Posted by Karl Anderson <kr...@monkey.org>.
On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote:
> Hello,
>
> Does anyone know is it possible to compare data on HDFS but avoid
> coping data to local box? I mean if I'd like to find difference
> between local text files I can use diff command. If files are at HDFS
> then I have to get them from HDFS to local box and only then do diff.
> Coping files to local fs is a bit annoying and could be problematical
> when files are huge, say 2-5 Gb.
You could always do this as a mapreduce task. "diff --brief" is
trivial, actually finding the diffs is left as an exercise for the
reader :) I'm currently doing a line-oriented diff of two files where
the order of the lines is unimportant, so I just have my reducer flag
lines that show up an odd number of times.
Karl Anderson
kra@monkey.org
http://monkey.org/~kra