You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andrey Pankov <ap...@iponweb.net> on 2008/09/04 15:51:48 UTC

Compare data on HDFS side

Hello,

Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.

Thanks in advance.

-- 
Andrey Pankov

Re: Compare data on HDFS side

Posted by Karl Anderson <kr...@monkey.org>.
On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote:

> Hello,
>
> Does anyone know is it possible to compare data on HDFS but avoid
> coping data to local box? I mean if I'd like to find difference
> between local text files I can use diff command. If files are at HDFS
> then I have to get them from HDFS to local box and only then do diff.
> Coping files to local fs is a bit annoying and could be problematical
> when files are huge, say 2-5 Gb.

You could always do this as a mapreduce task.  "diff --brief" is  
trivial, actually finding the diffs is left as an exercise for the  
reader :)  I'm currently doing a line-oriented diff of two files where  
the order of the lines is unimportant, so I just have my reducer flag  
lines that show up an odd number of times.


Karl Anderson
kra@monkey.org
http://monkey.org/~kra