You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Lior Schachter <li...@gmail.com> on 2013/08/01 17:21:52 UTC

Efficient join operation

Hi all,
I have a very simple M/r scenario -join subscribers records (50 M
records/20TB) with subscriber events (1 B records/5TB). The goal is to
update the subscribers records with incoming events.

Few possible solutions:
1. Reduce side join. In map, omit subscriber id as key. Reduce will get the
subscriber record + events and application code can update the record using
the events.

2. Arrange the records/events in smaller files (by subscriber modulo)
invoke multiple M/R jobs on each per. e.g. records_m_1*events_m_1,
records_m_2*events_m_2. Same logic like in 1, but now working on smaller
files with better correlation.

3. Use Composite Join pattern, which means that join will be in Map side
saving additional Read/Write operation. On the other hand it means that
events files should be sorted (so that another M/R job).

I'd appreciate the forum feedback on the 3 alternatives.

Thanks,
Lior