You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Joseph Stein <cr...@gmail.com> on 2010/03/25 19:09:07 UTC

DeDuplication Techniques

I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).

The data sets we have are littered with repeats of data from mobile
devices which continue to come in over time (so we may see duplicates
of data re-posted months after it originally posted...)

I have 2 ways so far I can go about it (one way I do in production
without Hadoop) and interested to see if others have faced/solved this
in Hadoop/HDFS and what their experience might be.

1) handle my own hash filter (where I continually store and look up a
hash (MD5, bloom, whatever) of the data I am aggregating on as
existing already).  We do this now without Hadoop perhaps a variant
can be ported into HDFS as map task, reducing the results to files and
restoring the hash table (maybe in Hive or something, dunno yet)
2) push the data into Cassandra (our NoSQL solution of choice) and let
that hash/map system do it for us.   As I get more into Hadoop looking
at HBas is tempting but then just one more thing to learn.

I would really like to not have to reinvent a wheel here and even
contribute if something is going on as it is a use case in our work
effort.

Thanx in advance =8^)  Apologize I posted this on common dev yesterday
by accident (so this is not a repost spam but appropriate for this
list)

Cheers.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Re: DeDuplication Techniques

Posted by Owen O'Malley <om...@apache.org>.

On Mar 25, 2010, at 11:09 AM, Joseph Stein wrote:

> I have been researching ways to handle de-dupping data while running a
> map/reduce program (so as to not re-calculate/re-aggregate data that
> we have seen before[possibly months before]).

So roughly, your problem is that you have large amounts of historic  
data and you need to merge in the current month. The best solution  
that I've seen looks like:

Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note that  
you want a total order, but because the md5's are evenly spaced across  
the key space this is easy. Basically, you pick a number of reduces  
(eg. 256) and then use the top N bits of the MD5 to pick your reduce.  
Since this job is only processing your new data, it is very fast.

Next you do a map-side join where each input split consists of an md5  
range. The RecordReader reads from the historic and new datasets  
merging them as they go. (You can use the map-side join library for  
this.) Your map does the merge of the new and old. This is a map-only  
job, so it also is very fast.

Of course if the new data is small enough you can read all of the new  
input in each of the maps and just keep (and sort in ram) the new  
records that are in the right range and do the merge from ram. This  
lets you avoid the step where you sort the new data. This kind of  
merge optimization is where Pig and Hive hide lots of the details from  
the developer.

-- Owen

Re: DeDuplication Techniques

Posted by Joseph Stein <cr...@gmail.com>.

Thanks everyone.

I think we are going to go HBase and use the hash map structures there
to keep uniqueness (table.exists(value)) on our values, see it how it
goes.

Appreciate the insights I do like (being a developer) the extended
sort and join ideas in the MR will likely use that for other things.
Seems like a lot of work just to get to the point to execute our
business logic (time/resources, etc).

As the old saying goes "sometimes you only need chopsticks to catch a fly" =8^)

Thanks again!!!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

On Fri, Mar 26, 2010 at 3:26 PM, Jeyendran Balakrishnan
<jb...@docomolabs-usa.com> wrote:
> Joe,
>
> This is what I use for a related problem, using pure HDFS [no HBase]:
>
> 1. Run a one-time map-reduce job where you input your current historical file of hashes [say it is of the format <hash-key, hash-value> in some kind of flat file] using IdentityMapper and the output of your custom reducer is <key, value> is <hash-key, hash-value> or maybe even <hash-key, dummy value> to save space. The important thing is to use MapFileOutputFormat for the reducer output instead of the typical SequenceFileOutputFormat. Now you have a single look-up table which you use for efficient lookup using your hash keys.
> Note down the HDFS path of where you stored this mapfile, call it dedupMapFile.
>
> 2. In your incremental data update job, pass the HDFS path of dedupMapFile to your conf, then open the mapfile in your reducer configure(), store the reference to the mapfile in the class, and close it in close().
> Inside your reduce(), use the mapfile reference to lookup your hashkey; if there is a hit, it is a dup.
>
> 3. Also, for your reducer in 2. above, you can use a multiple output format custom format, in which one of the outputs is your current output, and the other is a new dedup output sequencefile which is in the same key-value format as the dedupMapFile. So in the reduce() if the current key value is a dup, discard it, else output to both your regular output, and the new dedup output.
>
> 4. After each incremental update job, run a new map reduce job [IdentityMapper and IdentityReducer] to merge the new dedup file with your old dedupMapFile, resulting in the updated dedupMapFile.
>
> Some comments:
> * I didn't read your approach too closely, so I suspect you might be doing something essentially like this already.
> * All this stuff is basically what HBase does for free, where your dedupMapFile is now a HBase table, and you don't have to run Step 4, since you can just write new [non-duplicate] hash-keys to the HBase table in Step 3, and in Step 2, you just use table.exists(hash-key) to check if it is a dup. You still need Step 1 to populate the table with your historical data.
>
> Hope this helps....
>
> Cheers,
> jp
>
>
> -----Original Message-----
> From: Joseph Stein [mailto:cryptcom@gmail.com]
> Sent: Thursday, March 25, 2010 11:35 AM
> To: common-user@hadoop.apache.org
> Subject: Re: DeDuplication Techniques
>
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com> wrote:
>> Joe,
>>
>> what about this approach:
>>
>> using hashmap values as your keys in MR maps. Since they are sorted by keys,
>> in reducer you will get all duplicates together, so that you can loop
>> through them. As the simplest solution, you just take the first one.
>>
>> Sincerely,
>> Mark
>>
>> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com> wrote:
>>
>>> I have been researching ways to handle de-dupping data while running a
>>> map/reduce program (so as to not re-calculate/re-aggregate data that
>>> we have seen before[possibly months before]).
>>>
>>> The data sets we have are littered with repeats of data from mobile
>>> devices which continue to come in over time (so we may see duplicates
>>> of data re-posted months after it originally posted...)
>>>
>>> I have 2 ways so far I can go about it (one way I do in production
>>> without Hadoop) and interested to see if others have faced/solved this
>>> in Hadoop/HDFS and what their experience might be.
>>>
>>> 1) handle my own hash filter (where I continually store and look up a
>>> hash (MD5, bloom, whatever) of the data I am aggregating on as
>>> existing already).  We do this now without Hadoop perhaps a variant
>>> can be ported into HDFS as map task, reducing the results to files and
>>> restoring the hash table (maybe in Hive or something, dunno yet)
>>> 2) push the data into Cassandra (our NoSQL solution of choice) and let
>>> that hash/map system do it for us.   As I get more into Hadoop looking
>>> at HBas is tempting but then just one more thing to learn.
>>>
>>> I would really like to not have to reinvent a wheel here and even
>>> contribute if something is going on as it is a use case in our work
>>> effort.
>>>
>>> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
>>> by accident (so this is not a repost spam but appropriate for this
>>> list)
>>>
>>> Cheers.
>>>
>>> /*
>>> Joe Stein
>>> http://www.linkedin.com/in/charmalloc
>>> */
>>>
>>
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>

RE: DeDuplication Techniques

Posted by Jeyendran Balakrishnan <jb...@docomolabs-usa.com>.

Joe,

This is what I use for a related problem, using pure HDFS [no HBase]:

1. Run a one-time map-reduce job where you input your current historical file of hashes [say it is of the format <hash-key, hash-value> in some kind of flat file] using IdentityMapper and the output of your custom reducer is <key, value> is <hash-key, hash-value> or maybe even <hash-key, dummy value> to save space. The important thing is to use MapFileOutputFormat for the reducer output instead of the typical SequenceFileOutputFormat. Now you have a single look-up table which you use for efficient lookup using your hash keys.
Note down the HDFS path of where you stored this mapfile, call it dedupMapFile.

2. In your incremental data update job, pass the HDFS path of dedupMapFile to your conf, then open the mapfile in your reducer configure(), store the reference to the mapfile in the class, and close it in close().
Inside your reduce(), use the mapfile reference to lookup your hashkey; if there is a hit, it is a dup.

3. Also, for your reducer in 2. above, you can use a multiple output format custom format, in which one of the outputs is your current output, and the other is a new dedup output sequencefile which is in the same key-value format as the dedupMapFile. So in the reduce() if the current key value is a dup, discard it, else output to both your regular output, and the new dedup output.

4. After each incremental update job, run a new map reduce job [IdentityMapper and IdentityReducer] to merge the new dedup file with your old dedupMapFile, resulting in the updated dedupMapFile.

Some comments:
* I didn't read your approach too closely, so I suspect you might be doing something essentially like this already.
* All this stuff is basically what HBase does for free, where your dedupMapFile is now a HBase table, and you don't have to run Step 4, since you can just write new [non-duplicate] hash-keys to the HBase table in Step 3, and in Step 2, you just use table.exists(hash-key) to check if it is a dup. You still need Step 1 to populate the table with your historical data.

Hope this helps....

Cheers,
jp

-----Original Message-----
From: Joseph Stein [mailto:cryptcom@gmail.com] 
Sent: Thursday, March 25, 2010 11:35 AM
To: common-user@hadoop.apache.org
Subject: Re: DeDuplication Techniques

The thing is I have to check historic data (meaning data I have
already aggregated against) so I basically need to hold and read from
a file of hashes.

So within the current data set yes this would work but I then have to
open a file, loop through the value, see it is not there.

If it is there then throw it out, if not there add it to the end.

To me this opening a file checking for dups is a map/reduce task in itself.

What I was thinking is having my mapper take the data I wasn to
validate as unique.  I then loop through the files filters.  each data
point has a key that then allows me to get the file that has it's
data. e.g. a part of the data partions the hash of the data so each
file holds.  So my map job takes the data and breaks it into the
key/value pair (the key allows me to look up my filter file).

When it gets to the reducer... the key is the file I open up, I then
open the file... loop through it... if it is there throw the data
away.  if it is not there then add the hash of my data to the filter
file and then output (as the reduce output) the value of the unique.

This output of the unique is then the data I aggregate on which also
updated my historic filter so the next job (5 minutes later) see it,
etc.

On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com> wrote:
> Joe,
>
> what about this approach:
>
> using hashmap values as your keys in MR maps. Since they are sorted by keys,
> in reducer you will get all duplicates together, so that you can loop
> through them. As the simplest solution, you just take the first one.
>
> Sincerely,
> Mark
>
> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com> wrote:
>
>> I have been researching ways to handle de-dupping data while running a
>> map/reduce program (so as to not re-calculate/re-aggregate data that
>> we have seen before[possibly months before]).
>>
>> The data sets we have are littered with repeats of data from mobile
>> devices which continue to come in over time (so we may see duplicates
>> of data re-posted months after it originally posted...)
>>
>> I have 2 ways so far I can go about it (one way I do in production
>> without Hadoop) and interested to see if others have faced/solved this
>> in Hadoop/HDFS and what their experience might be.
>>
>> 1) handle my own hash filter (where I continually store and look up a
>> hash (MD5, bloom, whatever) of the data I am aggregating on as
>> existing already).  We do this now without Hadoop perhaps a variant
>> can be ported into HDFS as map task, reducing the results to files and
>> restoring the hash table (maybe in Hive or something, dunno yet)
>> 2) push the data into Cassandra (our NoSQL solution of choice) and let
>> that hash/map system do it for us.   As I get more into Hadoop looking
>> at HBas is tempting but then just one more thing to learn.
>>
>> I would really like to not have to reinvent a wheel here and even
>> contribute if something is going on as it is a use case in our work
>> effort.
>>
>> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
>> by accident (so this is not a repost spam but appropriate for this
>> list)
>>
>> Cheers.
>>
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> */
>>
>

-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

RE: DeDuplication Techniques

Posted by Michael Segel <mi...@hotmail.com>.

Joe,

You know you mentioned HBase, but have you thought about it?

This is actually a simple thing to do in HBase because in your map/reduce you hash your key and then check to see if HBase already has your key. (You make your hbase connection in setup() so you don't constantly open/close connections....[Sorry, its a common mistake...])
Actually reading from HDFS and writing to HBase is pretty simple in 20.x stuff.

Also I tend to like SHA1 encryption for the keys. There's enough documentation to show that the SHA1 hash is almost guaranteeing a unique hash.

There's some variations on this, but its pretty fast.

Of course I'm assuming that you don't want to just dedup against the current job, but may want to dedupe againt prior jobs as well?
Even if its for the same job and the table is temporary, it may be faster... 

-Mike

> Date: Thu, 25 Mar 2010 14:35:23 -0400
> Subject: Re: DeDuplication Techniques
> From: cryptcom@gmail.com
> To: common-user@hadoop.apache.org
> 
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
> 
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
> 
> If it is there then throw it out, if not there add it to the end.
> 
> To me this opening a file checking for dups is a map/reduce task in itself.
> 
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
> 
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
> 
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
> 
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com> wrote:
> > Joe,
> >
> > what about this approach:
> >
> > using hashmap values as your keys in MR maps. Since they are sorted by keys,
> > in reducer you will get all duplicates together, so that you can loop
> > through them. As the simplest solution, you just take the first one.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com> wrote:
> >
> >> I have been researching ways to handle de-dupping data while running a
> >> map/reduce program (so as to not re-calculate/re-aggregate data that
> >> we have seen before[possibly months before]).
> >>
> >> The data sets we have are littered with repeats of data from mobile
> >> devices which continue to come in over time (so we may see duplicates
> >> of data re-posted months after it originally posted...)
> >>
> >> I have 2 ways so far I can go about it (one way I do in production
> >> without Hadoop) and interested to see if others have faced/solved this
> >> in Hadoop/HDFS and what their experience might be.
> >>
> >> 1) handle my own hash filter (where I continually store and look up a
> >> hash (MD5, bloom, whatever) of the data I am aggregating on as
> >> existing already).  We do this now without Hadoop perhaps a variant
> >> can be ported into HDFS as map task, reducing the results to files and
> >> restoring the hash table (maybe in Hive or something, dunno yet)
> >> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> >> that hash/map system do it for us.   As I get more into Hadoop looking
> >> at HBas is tempting but then just one more thing to learn.
> >>
> >> I would really like to not have to reinvent a wheel here and even
> >> contribute if something is going on as it is a use case in our work
> >> effort.
> >>
> >> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> >> by accident (so this is not a repost spam but appropriate for this
> >> list)
> >>
> >> Cheers.
> >>
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >>
> >
> 
> 
> 
> -- 
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL:ON:WL:en-US:WM_HMP:032010_2

Re: DeDuplication Techniques

Posted by Mark Kerzner <ma...@gmail.com>.

Joe,

your approach would work, whether you use files to keep old data, or a
database. However, it feels like a mix of new and old technologies. It just
does not feel right to open a file to do just one comparison, and close it
again. Even if you keep it open and do searches there, and even if you
optimize searches for some binary quick searches, still you are not using
the Hadoop at this point.

What about if you make your deduping a follow-up Hadoop job, where you will
open all historical results at once, and treat this as one large hadoop job
- large in the sense that it reads more data than current dataset - but the
processing requirements here are not huge. Then you delegate the processing
side of it back to Hadoop, with its sort which is already built-in.

Another advantage of this approach is that you may have different
requirements for deduping on the next state, so that if your hadoop deduping
is separate, you can accommodate that. For example, in my case I may have to
dedupe within specific document groups, or across the board, and you  may
have something similar.

I will be curious to know which approach you will finally select.

Mark

On Thu, Mar 25, 2010 at 1:35 PM, Joseph Stein <cr...@gmail.com> wrote:

> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > Joe,
> >
> > what about this approach:
> >
> > using hashmap values as your keys in MR maps. Since they are sorted by
> keys,
> > in reducer you will get all duplicates together, so that you can loop
> > through them. As the simplest solution, you just take the first one.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com>
> wrote:
> >
> >> I have been researching ways to handle de-dupping data while running a
> >> map/reduce program (so as to not re-calculate/re-aggregate data that
> >> we have seen before[possibly months before]).
> >>
> >> The data sets we have are littered with repeats of data from mobile
> >> devices which continue to come in over time (so we may see duplicates
> >> of data re-posted months after it originally posted...)
> >>
> >> I have 2 ways so far I can go about it (one way I do in production
> >> without Hadoop) and interested to see if others have faced/solved this
> >> in Hadoop/HDFS and what their experience might be.
> >>
> >> 1) handle my own hash filter (where I continually store and look up a
> >> hash (MD5, bloom, whatever) of the data I am aggregating on as
> >> existing already).  We do this now without Hadoop perhaps a variant
> >> can be ported into HDFS as map task, reducing the results to files and
> >> restoring the hash table (maybe in Hive or something, dunno yet)
> >> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> >> that hash/map system do it for us.   As I get more into Hadoop looking
> >> at HBas is tempting but then just one more thing to learn.
> >>
> >> I would really like to not have to reinvent a wheel here and even
> >> contribute if something is going on as it is a use case in our work
> >> effort.
> >>
> >> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> >> by accident (so this is not a repost spam but appropriate for this
> >> list)
> >>
> >> Cheers.
> >>
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >>
> >
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>

Re: DeDuplication Techniques

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

Yes that is the next logical step in performance optimization :-)
When you have no historical data to begin with, it makes little difference.
However, as the volume of historical data grows with time the gains become
more evident.

As for sort performance, only the new data will need sorting as historical data
is already kept sorted so the performance shouldn't be a problem.

So the steps become like this.

First M/R job -> Sort new data.
Second Map-only  job -> Merge sorted historical data with sorted new data.

-@nkur

On 3/26/10 3:35 PM, "Andrew Klochkov" <di...@gmail.com> wrote:

Would it be a good optimization to have historical data (stored in HDFS)
sorted by the primary key, and also sort new data before joining? I guess in
this case join can be performed more effective (in a InputFormat
implementation), avoiding sort/shuffle/copy-to-reducer part.
The only double I have is how fast can data be sorted, wouldn't it kill all
the optimization.

On Fri, Mar 26, 2010 at 7:20 AM, Ankur C. Goel <ga...@yahoo-inc.com> wrote:

> The kind of need to you specified is quite common in ETL style of
> processing. The fastest and most efficient way to do this is when you have
> all your historical data in HDFS itself. In this case you can do a LEFT
> outer join between the two datasets (assuming new data is your left
> relation) in map-reduce without querying a database or any other persistent
> store. Then you would keep only the records which have fields from the left
> relation and NOT the right relation (historical data).
>
> A join can be easily implemented in map-reduce using the secondary sort
> trick. Basically you can specify different mappers for different input data
> in the same M/R job and in each mapper tag the record key with relation ids
> (0, 1...). This makes sure that records from one relation for matching key
> appear before the record of other relation in reducer. You then cache them
> in memory and do a cross of this with each record of the new relation you
> see.
> This might sound more complicated then it really is. Hadoop has sample code
> under examples for secondary sort but no code for join.
>
> Another option is to use a high level languages like Pig or HIVE that
> provide join operations and also expose extensions to take care of skew in
> data i.e data getting divided unevenly die to few keys having majority of
> records. This is the simplest and quickest (in terms of developer
> productivity) IMO.
>
> Regards
> -@nkur
>
>
> On 3/26/10 12:05 AM, "Joseph Stein" <cr...@gmail.com> wrote:
>
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > Joe,
> >
> > what about this approach:
> >
> > using hashmap values as your keys in MR maps. Since they are sorted by
> keys,
> > in reducer you will get all duplicates together, so that you can loop
> > through them. As the simplest solution, you just take the first one.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com>
> wrote:
> >
> >> I have been researching ways to handle de-dupping data while running a
> >> map/reduce program (so as to not re-calculate/re-aggregate data that
> >> we have seen before[possibly months before]).
> >>
> >> The data sets we have are littered with repeats of data from mobile
> >> devices which continue to come in over time (so we may see duplicates
> >> of data re-posted months after it originally posted...)
> >>
> >> I have 2 ways so far I can go about it (one way I do in production
> >> without Hadoop) and interested to see if others have faced/solved this
> >> in Hadoop/HDFS and what their experience might be.
> >>
> >> 1) handle my own hash filter (where I continually store and look up a
> >> hash (MD5, bloom, whatever) of the data I am aggregating on as
> >> existing already).  We do this now without Hadoop perhaps a variant
> >> can be ported into HDFS as map task, reducing the results to files and
> >> restoring the hash table (maybe in Hive or something, dunno yet)
> >> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> >> that hash/map system do it for us.   As I get more into Hadoop looking
> >> at HBas is tempting but then just one more thing to learn.
> >>
> >> I would really like to not have to reinvent a wheel here and even
> >> contribute if something is going on as it is a use case in our work
> >> effort.
> >>
> >> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> >> by accident (so this is not a repost spam but appropriate for this
> >> list)
> >>
> >> Cheers.
> >>
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >>
> >
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>
>


--
Andrew Klochkov

Re: DeDuplication Techniques

Posted by Andrew Klochkov <di...@gmail.com>.

Would it be a good optimization to have historical data (stored in HDFS)
sorted by the primary key, and also sort new data before joining? I guess in
this case join can be performed more effective (in a InputFormat
implementation), avoiding sort/shuffle/copy-to-reducer part.
The only double I have is how fast can data be sorted, wouldn't it kill all
the optimization.

On Fri, Mar 26, 2010 at 7:20 AM, Ankur C. Goel <ga...@yahoo-inc.com> wrote:

> The kind of need to you specified is quite common in ETL style of
> processing. The fastest and most efficient way to do this is when you have
> all your historical data in HDFS itself. In this case you can do a LEFT
> outer join between the two datasets (assuming new data is your left
> relation) in map-reduce without querying a database or any other persistent
> store. Then you would keep only the records which have fields from the left
> relation and NOT the right relation (historical data).
>
> A join can be easily implemented in map-reduce using the secondary sort
> trick. Basically you can specify different mappers for different input data
> in the same M/R job and in each mapper tag the record key with relation ids
> (0, 1...). This makes sure that records from one relation for matching key
> appear before the record of other relation in reducer. You then cache them
> in memory and do a cross of this with each record of the new relation you
> see.
> This might sound more complicated then it really is. Hadoop has sample code
> under examples for secondary sort but no code for join.
>
> Another option is to use a high level languages like Pig or HIVE that
> provide join operations and also expose extensions to take care of skew in
> data i.e data getting divided unevenly die to few keys having majority of
> records. This is the simplest and quickest (in terms of developer
> productivity) IMO.
>
> Regards
> -@nkur
>
>
> On 3/26/10 12:05 AM, "Joseph Stein" <cr...@gmail.com> wrote:
>
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > Joe,
> >
> > what about this approach:
> >
> > using hashmap values as your keys in MR maps. Since they are sorted by
> keys,
> > in reducer you will get all duplicates together, so that you can loop
> > through them. As the simplest solution, you just take the first one.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com>
> wrote:
> >
> >> I have been researching ways to handle de-dupping data while running a
> >> map/reduce program (so as to not re-calculate/re-aggregate data that
> >> we have seen before[possibly months before]).
> >>
> >> The data sets we have are littered with repeats of data from mobile
> >> devices which continue to come in over time (so we may see duplicates
> >> of data re-posted months after it originally posted...)
> >>
> >> I have 2 ways so far I can go about it (one way I do in production
> >> without Hadoop) and interested to see if others have faced/solved this
> >> in Hadoop/HDFS and what their experience might be.
> >>
> >> 1) handle my own hash filter (where I continually store and look up a
> >> hash (MD5, bloom, whatever) of the data I am aggregating on as
> >> existing already).  We do this now without Hadoop perhaps a variant
> >> can be ported into HDFS as map task, reducing the results to files and
> >> restoring the hash table (maybe in Hive or something, dunno yet)
> >> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> >> that hash/map system do it for us.   As I get more into Hadoop looking
> >> at HBas is tempting but then just one more thing to learn.
> >>
> >> I would really like to not have to reinvent a wheel here and even
> >> contribute if something is going on as it is a use case in our work
> >> effort.
> >>
> >> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> >> by accident (so this is not a repost spam but appropriate for this
> >> list)
> >>
> >> Cheers.
> >>
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >>
> >
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>
>


-- 
Andrew Klochkov

Re: DeDuplication Techniques

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

The kind of need to you specified is quite common in ETL style of processing. The fastest and most efficient way to do this is when you have all your historical data in HDFS itself. In this case you can do a LEFT outer join between the two datasets (assuming new data is your left relation) in map-reduce without querying a database or any other persistent store. Then you would keep only the records which have fields from the left relation and NOT the right relation (historical data).

A join can be easily implemented in map-reduce using the secondary sort trick. Basically you can specify different mappers for different input data in the same M/R job and in each mapper tag the record key with relation ids (0, 1...). This makes sure that records from one relation for matching key appear before the record of other relation in reducer. You then cache them in memory and do a cross of this with each record of the new relation you see.
This might sound more complicated then it really is. Hadoop has sample code under examples for secondary sort but no code for join.

Another option is to use a high level languages like Pig or HIVE that provide join operations and also expose extensions to take care of skew in data i.e data getting divided unevenly die to few keys having majority of records. This is the simplest and quickest (in terms of developer productivity) IMO.

Regards
-@nkur

On 3/26/10 12:05 AM, "Joseph Stein" <cr...@gmail.com> wrote:

The thing is I have to check historic data (meaning data I have
already aggregated against) so I basically need to hold and read from
a file of hashes.

So within the current data set yes this would work but I then have to
open a file, loop through the value, see it is not there.

If it is there then throw it out, if not there add it to the end.

To me this opening a file checking for dups is a map/reduce task in itself.

What I was thinking is having my mapper take the data I wasn to
validate as unique.  I then loop through the files filters.  each data
point has a key that then allows me to get the file that has it's
data. e.g. a part of the data partions the hash of the data so each
file holds.  So my map job takes the data and breaks it into the
key/value pair (the key allows me to look up my filter file).

When it gets to the reducer... the key is the file I open up, I then
open the file... loop through it... if it is there throw the data
away.  if it is not there then add the hash of my data to the filter
file and then output (as the reduce output) the value of the unique.

This output of the unique is then the data I aggregate on which also
updated my historic filter so the next job (5 minutes later) see it,
etc.

On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com> wrote:
> Joe,
>
> what about this approach:
>
> using hashmap values as your keys in MR maps. Since they are sorted by keys,
> in reducer you will get all duplicates together, so that you can loop
> through them. As the simplest solution, you just take the first one.
>
> Sincerely,
> Mark
>
> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com> wrote:
>
>> I have been researching ways to handle de-dupping data while running a
>> map/reduce program (so as to not re-calculate/re-aggregate data that
>> we have seen before[possibly months before]).
>>
>> The data sets we have are littered with repeats of data from mobile
>> devices which continue to come in over time (so we may see duplicates
>> of data re-posted months after it originally posted...)
>>
>> I have 2 ways so far I can go about it (one way I do in production
>> without Hadoop) and interested to see if others have faced/solved this
>> in Hadoop/HDFS and what their experience might be.
>>
>> 1) handle my own hash filter (where I continually store and look up a
>> hash (MD5, bloom, whatever) of the data I am aggregating on as
>> existing already).  We do this now without Hadoop perhaps a variant
>> can be ported into HDFS as map task, reducing the results to files and
>> restoring the hash table (maybe in Hive or something, dunno yet)
>> 2) push the data into Cassandra (our NoSQL solution of choice) and let
>> that hash/map system do it for us.   As I get more into Hadoop looking
>> at HBas is tempting but then just one more thing to learn.
>>
>> I would really like to not have to reinvent a wheel here and even
>> contribute if something is going on as it is a use case in our work
>> effort.
>>
>> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
>> by accident (so this is not a repost spam but appropriate for this
>> list)
>>
>> Cheers.
>>
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> */
>>
>

--
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Re: DeDuplication Techniques

Posted by Joseph Stein <cr...@gmail.com>.

The thing is I have to check historic data (meaning data I have
already aggregated against) so I basically need to hold and read from
a file of hashes.

So within the current data set yes this would work but I then have to
open a file, loop through the value, see it is not there.

If it is there then throw it out, if not there add it to the end.

To me this opening a file checking for dups is a map/reduce task in itself.

What I was thinking is having my mapper take the data I wasn to
validate as unique.  I then loop through the files filters.  each data
point has a key that then allows me to get the file that has it's
data. e.g. a part of the data partions the hash of the data so each
file holds.  So my map job takes the data and breaks it into the
key/value pair (the key allows me to look up my filter file).

When it gets to the reducer... the key is the file I open up, I then
open the file... loop through it... if it is there throw the data
away.  if it is not there then add the hash of my data to the filter
file and then output (as the reduce output) the value of the unique.

This output of the unique is then the data I aggregate on which also
updated my historic filter so the next job (5 minutes later) see it,
etc.

On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <ma...@gmail.com> wrote:
> Joe,
>
> what about this approach:
>
> using hashmap values as your keys in MR maps. Since they are sorted by keys,
> in reducer you will get all duplicates together, so that you can loop
> through them. As the simplest solution, you just take the first one.
>
> Sincerely,
> Mark
>
> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com> wrote:
>
>> I have been researching ways to handle de-dupping data while running a
>> map/reduce program (so as to not re-calculate/re-aggregate data that
>> we have seen before[possibly months before]).
>>
>> The data sets we have are littered with repeats of data from mobile
>> devices which continue to come in over time (so we may see duplicates
>> of data re-posted months after it originally posted...)
>>
>> I have 2 ways so far I can go about it (one way I do in production
>> without Hadoop) and interested to see if others have faced/solved this
>> in Hadoop/HDFS and what their experience might be.
>>
>> 1) handle my own hash filter (where I continually store and look up a
>> hash (MD5, bloom, whatever) of the data I am aggregating on as
>> existing already).  We do this now without Hadoop perhaps a variant
>> can be ported into HDFS as map task, reducing the results to files and
>> restoring the hash table (maybe in Hive or something, dunno yet)
>> 2) push the data into Cassandra (our NoSQL solution of choice) and let
>> that hash/map system do it for us.   As I get more into Hadoop looking
>> at HBas is tempting but then just one more thing to learn.
>>
>> I would really like to not have to reinvent a wheel here and even
>> contribute if something is going on as it is a use case in our work
>> effort.
>>
>> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
>> by accident (so this is not a repost spam but appropriate for this
>> list)
>>
>> Cheers.
>>
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> */
>>
>

-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Re: DeDuplication Techniques

Posted by Mark Kerzner <ma...@gmail.com>.

Joe,

what about this approach:

using hashmap values as your keys in MR maps. Since they are sorted by keys,
in reducer you will get all duplicates together, so that you can loop
through them. As the simplest solution, you just take the first one.

Sincerely,
Mark

On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cr...@gmail.com> wrote:

> I have been researching ways to handle de-dupping data while running a
> map/reduce program (so as to not re-calculate/re-aggregate data that
> we have seen before[possibly months before]).
>
> The data sets we have are littered with repeats of data from mobile
> devices which continue to come in over time (so we may see duplicates
> of data re-posted months after it originally posted...)
>
> I have 2 ways so far I can go about it (one way I do in production
> without Hadoop) and interested to see if others have faced/solved this
> in Hadoop/HDFS and what their experience might be.
>
> 1) handle my own hash filter (where I continually store and look up a
> hash (MD5, bloom, whatever) of the data I am aggregating on as
> existing already).  We do this now without Hadoop perhaps a variant
> can be ported into HDFS as map task, reducing the results to files and
> restoring the hash table (maybe in Hive or something, dunno yet)
> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> that hash/map system do it for us.   As I get more into Hadoop looking
> at HBas is tempting but then just one more thing to learn.
>
> I would really like to not have to reinvent a wheel here and even
> contribute if something is going on as it is a use case in our work
> effort.
>
> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> by accident (so this is not a repost spam but appropriate for this
> list)
>
> Cheers.
>
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>