You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/03/15 15:24:36 UTC

Compare Files

Hi,

I have 2 files in this format:
file1: (source, target)
file2: (source)

I would like to write MR which will output all records in file1 that their
source isn't in file2. Example:
file1:
1,2
2,9
3,5

file2:
2
7

outcome:
1,2
3,5

Could you help me with this ?

Re: Compare Files

Posted by Owen O'Malley <ow...@gmail.com>.
On Sun, Mar 15, 2009 at 7:24 AM, Tamir Kamara <ta...@gmail.com> wrote:

> Hi,
>
> I have 2 files in this format:
> file1: (source, target)
> file2: (source)


I would map file 1 like:

(source, 2), (target)

I would map file 2 like:

(source, 1), (null)

Set a partitioner and grouping comparator to compare only the first part of
the key. On each call to reduce, if the first value is null then emit the
rest and otherwise discard them.

-- Owen

Re: Compare Files

Posted by Tarandeep Singh <ta...@gmail.com>.
Map- Output key,value pair as- (source, file_num)
1,1
2,1
3,1
2,2
7,2

Reduce- (1, [1]), (2, [1,2]), (3, [1]), (7, [2])
Ouptut only those keys whose list of values do not contain file2-
1
3

-Taran

On Sun, Mar 15, 2009 at 7:24 AM, Tamir Kamara <ta...@gmail.com> wrote:

> Hi,
>
> I have 2 files in this format:
> file1: (source, target)
> file2: (source)
>
> I would like to write MR which will output all records in file1 that their
> source isn't in file2. Example:
> file1:
> 1,2
> 2,9
> 3,5
>
> file2:
> 2
> 7
>
> outcome:
> 1,2
> 3,5
>
> Could you help me with this ?
>