You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/03/24 12:33:07 UTC

Join Variation

Hi,

We need to implement a Join with a between operator instead of an equal.
What we are trying to do is search a file for a key where the key falls
between two fields in the search file like this:

main file (ip, a, b):
(80, zz, yy)
(125, vv, bb)

search file (from-ip, to-ip, d, e):
(52, 75, xxx, yyy)
(78, 98, aaa, bbb)
(99, 115, xxx, ddd)
(125, 130, hhh, aaa)
(150, 162, qqq, sss)

the outcome should be in the form (ip, a, b, d, e):
(80, zz, yy, aaa, bbb)
(125, vv, bb, eee, hhh)

We could convert the ip ranges in the search file to single record ips and
then do a regular join, but the number of single ips is huge and this is
probably not a good way.
What would be a good course for doing this in hadoop ?


Thanks,
Tamir

Re: Join Variation

Posted by jason hadoop <ja...@gmail.com>.
Probably be available in a week or so, as draft one isn't quite finished :)

On Thu, Apr 2, 2009 at 1:45 AM, Stefan Podkowinski <sp...@gmail.com> wrote:

> .. and is not yet available as an alpha book chapter. Any chance uploading
> it?
>
> On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop <ja...@gmail.com>
> wrote:
> > Just for fun, chapter 9 in my book is a work through of solving this
> class
> > of problem.
> >
> >
> > On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop <jason.hadoop@gmail.com
> >wrote:
> >
> >> For the classic map/reduce job, you have 3 requirements.
> >>
> >> 1) a comparator that provide the keys in ip address order, such that all
> >> keys in one of your ranges, would be contiguous, when sorted with the
> >> comparator
> >> 2) a partitioner that ensures that all keys that should be together end
> up
> >> in the same partition
> >> 3) and output value grouping comparator that considered all keys in a
> >> specified range equal.
> >>
> >> The comparator only sorts by the first part of the key, the search file
> has
> >> a 2 part key begin/end the input data has just a 1 part key.
> >>
> >> A partitioner that new ahead of time the group sets in your search set,
> in
> >> the way that the tera sort example works would be ideal:
> >> ie: it builds an index of ranges from your seen set so that the ranges
> get
> >> rougly evenly split between your reduces.
> >> This requires a pass over the search file to write out a summary file,
> >> which is then loaded by the partitioner.
> >>
> >> The output value grouping comparator, will get the keys in order of the
> >> first token, and will define the start of a group by the presence of a 2
> >> part key, and consider the group ended when either another 2 part key
> >> appears, or when the key value is larger than the second part of the
> >> starting key. - This does require that the grouping comparator maintain
> >> state.
> >>
> >> At this point, your reduce will be called with the first key in the key
> >> equivalence group of (3), with the values of all of the keys
> >>
> >> In your map, any address that is not in a range of interest is not
> passed
> >> to output.collect.
> >>
> >> For the map side join code, you have to define a comparator on the key
> type
> >> that defines your definition of equivalence and ordering, and call
> >> WritableComparator.define( Key.class, comparator.class ), to force the
> join
> >> code to use your comparator.
> >>
> >> For tables with duplicates, per the key comparator, in map side join,
> your
> >> map fuction will receive a row for every permutation of the duplicate
> keys:
> >> if you have one table a, 1; a, 2; and another table with a, 3; a, 4;
> your
> >> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;
> >>
> >>
> >>
> >> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <tamirkamara@gmail.com
> >wrote:
> >>
> >>> Thanks for all who replies.
> >>>
> >>> Stefan -
> >>> I'm unable to see how converting IP ranges to network masks would help
> >>> because different ranges can have the same network mask and with that I
> >>> still have to do a comparison of two fields: the searched IP with
> >>> from-IP&mask.
> >>>
> >>> Pig - I'm familier with pig and use it many times, but I can't think of
> a
> >>> way to write a pig script that will do this type of "join". I'll ask
> the
> >>> pig
> >>> users group.
> >>>
> >>> The search file is indeed large in terms of the amount records.
> However, I
> >>> don't see this as an issue yet, because I'm still puzzeled with how to
> >>> write
> >>> the job in plain MR. The join code is looking for an exact match in the
> >>> keys
> >>> and that is not what I need. Would a custom comperator which will look
> for
> >>> a
> >>> match in between the ranges, be the right choice to do this ?
> >>>
> >>> Thanks,
> >>> Tamir
> >>>
> >>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.hadoop@gmail.com
> >>> >wrote:
> >>>
> >>> > If the search file data set is large, the issue becomes ensuring that
> >>> only
> >>> > the required portion of search file is actually read, and that those
> >>> reads
> >>> > are ordered, in search file's key order.
> >>> >
> >>> > If the data set is small, most any of the common patterns will work.
> >>> >
> >>> > I haven't looked at pig for a while, does pig now use indexes in map
> >>> files,
> >>> > and take into account that a data set is sorted?
> >>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
> >>> will
> >>> > do a decent job of this, but the entire search file set will be read.
> >>> > To stop reading the entire search file, a record reader or join type,
> >>> would
> >>> > need to be put together to:
> >>> > a) skip to the first key of interest, using the index if available
> >>> > b) finish when the last possible key of interest has been delivered.
> >>> >
> >>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j....@gmail.com>
> >>> wrote:
> >>> >
> >>> > > In addition to other suggestions, you could also take a look at
> >>> > > building a Cascading job with a custom Joiner class.
> >>> > >
> >>> > > - John
> >>> > >
> >>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <
> tamirkamara@gmail.com>
> >>> > > wrote:
> >>> > > > Hi,
> >>> > > >
> >>> > > > We need to implement a Join with a between operator instead of an
> >>> > equal.
> >>> > > > What we are trying to do is search a file for a key where the key
> >>> falls
> >>> > > > between two fields in the search file like this:
> >>> > > >
> >>> > > > main file (ip, a, b):
> >>> > > > (80, zz, yy)
> >>> > > > (125, vv, bb)
> >>> > > >
> >>> > > > search file (from-ip, to-ip, d, e):
> >>> > > > (52, 75, xxx, yyy)
> >>> > > > (78, 98, aaa, bbb)
> >>> > > > (99, 115, xxx, ddd)
> >>> > > > (125, 130, hhh, aaa)
> >>> > > > (150, 162, qqq, sss)
> >>> > > >
> >>> > > > the outcome should be in the form (ip, a, b, d, e):
> >>> > > > (80, zz, yy, aaa, bbb)
> >>> > > > (125, vv, bb, eee, hhh)
> >>> > > >
> >>> > > > We could convert the ip ranges in the search file to single
> record
> >>> ips
> >>> > > and
> >>> > > > then do a regular join, but the number of single ips is huge and
> >>> this
> >>> > is
> >>> > > > probably not a good way.
> >>> > > > What would be a good course for doing this in hadoop ?
> >>> > > >
> >>> > > >
> >>> > > > Thanks,
> >>> > > > Tamir
> >>> > > >
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Alpha Chapters of my book on Hadoop are available
> >>> > http://www.apress.com/book/view/9781430219422
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Alpha Chapters of my book on Hadoop are available
> >> http://www.apress.com/book/view/9781430219422
> >>
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Join Variation

Posted by Stefan Podkowinski <sp...@gmail.com>.
.. and is not yet available as an alpha book chapter. Any chance uploading it?

On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop <ja...@gmail.com> wrote:
> Just for fun, chapter 9 in my book is a work through of solving this class
> of problem.
>
>
> On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop <ja...@gmail.com>wrote:
>
>> For the classic map/reduce job, you have 3 requirements.
>>
>> 1) a comparator that provide the keys in ip address order, such that all
>> keys in one of your ranges, would be contiguous, when sorted with the
>> comparator
>> 2) a partitioner that ensures that all keys that should be together end up
>> in the same partition
>> 3) and output value grouping comparator that considered all keys in a
>> specified range equal.
>>
>> The comparator only sorts by the first part of the key, the search file has
>> a 2 part key begin/end the input data has just a 1 part key.
>>
>> A partitioner that new ahead of time the group sets in your search set, in
>> the way that the tera sort example works would be ideal:
>> ie: it builds an index of ranges from your seen set so that the ranges get
>> rougly evenly split between your reduces.
>> This requires a pass over the search file to write out a summary file,
>> which is then loaded by the partitioner.
>>
>> The output value grouping comparator, will get the keys in order of the
>> first token, and will define the start of a group by the presence of a 2
>> part key, and consider the group ended when either another 2 part key
>> appears, or when the key value is larger than the second part of the
>> starting key. - This does require that the grouping comparator maintain
>> state.
>>
>> At this point, your reduce will be called with the first key in the key
>> equivalence group of (3), with the values of all of the keys
>>
>> In your map, any address that is not in a range of interest is not passed
>> to output.collect.
>>
>> For the map side join code, you have to define a comparator on the key type
>> that defines your definition of equivalence and ordering, and call
>> WritableComparator.define( Key.class, comparator.class ), to force the join
>> code to use your comparator.
>>
>> For tables with duplicates, per the key comparator, in map side join, your
>> map fuction will receive a row for every permutation of the duplicate keys:
>> if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
>> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;
>>
>>
>>
>> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <ta...@gmail.com>wrote:
>>
>>> Thanks for all who replies.
>>>
>>> Stefan -
>>> I'm unable to see how converting IP ranges to network masks would help
>>> because different ranges can have the same network mask and with that I
>>> still have to do a comparison of two fields: the searched IP with
>>> from-IP&mask.
>>>
>>> Pig - I'm familier with pig and use it many times, but I can't think of a
>>> way to write a pig script that will do this type of "join". I'll ask the
>>> pig
>>> users group.
>>>
>>> The search file is indeed large in terms of the amount records. However, I
>>> don't see this as an issue yet, because I'm still puzzeled with how to
>>> write
>>> the job in plain MR. The join code is looking for an exact match in the
>>> keys
>>> and that is not what I need. Would a custom comperator which will look for
>>> a
>>> match in between the ranges, be the right choice to do this ?
>>>
>>> Thanks,
>>> Tamir
>>>
>>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.hadoop@gmail.com
>>> >wrote:
>>>
>>> > If the search file data set is large, the issue becomes ensuring that
>>> only
>>> > the required portion of search file is actually read, and that those
>>> reads
>>> > are ordered, in search file's key order.
>>> >
>>> > If the data set is small, most any of the common patterns will work.
>>> >
>>> > I haven't looked at pig for a while, does pig now use indexes in map
>>> files,
>>> > and take into account that a data set is sorted?
>>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
>>> will
>>> > do a decent job of this, but the entire search file set will be read.
>>> > To stop reading the entire search file, a record reader or join type,
>>> would
>>> > need to be put together to:
>>> > a) skip to the first key of interest, using the index if available
>>> > b) finish when the last possible key of interest has been delivered.
>>> >
>>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j....@gmail.com>
>>> wrote:
>>> >
>>> > > In addition to other suggestions, you could also take a look at
>>> > > building a Cascading job with a custom Joiner class.
>>> > >
>>> > > - John
>>> > >
>>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <ta...@gmail.com>
>>> > > wrote:
>>> > > > Hi,
>>> > > >
>>> > > > We need to implement a Join with a between operator instead of an
>>> > equal.
>>> > > > What we are trying to do is search a file for a key where the key
>>> falls
>>> > > > between two fields in the search file like this:
>>> > > >
>>> > > > main file (ip, a, b):
>>> > > > (80, zz, yy)
>>> > > > (125, vv, bb)
>>> > > >
>>> > > > search file (from-ip, to-ip, d, e):
>>> > > > (52, 75, xxx, yyy)
>>> > > > (78, 98, aaa, bbb)
>>> > > > (99, 115, xxx, ddd)
>>> > > > (125, 130, hhh, aaa)
>>> > > > (150, 162, qqq, sss)
>>> > > >
>>> > > > the outcome should be in the form (ip, a, b, d, e):
>>> > > > (80, zz, yy, aaa, bbb)
>>> > > > (125, vv, bb, eee, hhh)
>>> > > >
>>> > > > We could convert the ip ranges in the search file to single record
>>> ips
>>> > > and
>>> > > > then do a regular join, but the number of single ips is huge and
>>> this
>>> > is
>>> > > > probably not a good way.
>>> > > > What would be a good course for doing this in hadoop ?
>>> > > >
>>> > > >
>>> > > > Thanks,
>>> > > > Tamir
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Alpha Chapters of my book on Hadoop are available
>>> > http://www.apress.com/book/view/9781430219422
>>> >
>>>
>>
>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>>
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>

Re: Join Variation

Posted by jason hadoop <ja...@gmail.com>.
Just for fun, chapter 9 in my book is a work through of solving this class
of problem.


On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop <ja...@gmail.com>wrote:

> For the classic map/reduce job, you have 3 requirements.
>
> 1) a comparator that provide the keys in ip address order, such that all
> keys in one of your ranges, would be contiguous, when sorted with the
> comparator
> 2) a partitioner that ensures that all keys that should be together end up
> in the same partition
> 3) and output value grouping comparator that considered all keys in a
> specified range equal.
>
> The comparator only sorts by the first part of the key, the search file has
> a 2 part key begin/end the input data has just a 1 part key.
>
> A partitioner that new ahead of time the group sets in your search set, in
> the way that the tera sort example works would be ideal:
> ie: it builds an index of ranges from your seen set so that the ranges get
> rougly evenly split between your reduces.
> This requires a pass over the search file to write out a summary file,
> which is then loaded by the partitioner.
>
> The output value grouping comparator, will get the keys in order of the
> first token, and will define the start of a group by the presence of a 2
> part key, and consider the group ended when either another 2 part key
> appears, or when the key value is larger than the second part of the
> starting key. - This does require that the grouping comparator maintain
> state.
>
> At this point, your reduce will be called with the first key in the key
> equivalence group of (3), with the values of all of the keys
>
> In your map, any address that is not in a range of interest is not passed
> to output.collect.
>
> For the map side join code, you have to define a comparator on the key type
> that defines your definition of equivalence and ordering, and call
> WritableComparator.define( Key.class, comparator.class ), to force the join
> code to use your comparator.
>
> For tables with duplicates, per the key comparator, in map side join, your
> map fuction will receive a row for every permutation of the duplicate keys:
> if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;
>
>
>
> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <ta...@gmail.com>wrote:
>
>> Thanks for all who replies.
>>
>> Stefan -
>> I'm unable to see how converting IP ranges to network masks would help
>> because different ranges can have the same network mask and with that I
>> still have to do a comparison of two fields: the searched IP with
>> from-IP&mask.
>>
>> Pig - I'm familier with pig and use it many times, but I can't think of a
>> way to write a pig script that will do this type of "join". I'll ask the
>> pig
>> users group.
>>
>> The search file is indeed large in terms of the amount records. However, I
>> don't see this as an issue yet, because I'm still puzzeled with how to
>> write
>> the job in plain MR. The join code is looking for an exact match in the
>> keys
>> and that is not what I need. Would a custom comperator which will look for
>> a
>> match in between the ranges, be the right choice to do this ?
>>
>> Thanks,
>> Tamir
>>
>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.hadoop@gmail.com
>> >wrote:
>>
>> > If the search file data set is large, the issue becomes ensuring that
>> only
>> > the required portion of search file is actually read, and that those
>> reads
>> > are ordered, in search file's key order.
>> >
>> > If the data set is small, most any of the common patterns will work.
>> >
>> > I haven't looked at pig for a while, does pig now use indexes in map
>> files,
>> > and take into account that a data set is sorted?
>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
>> will
>> > do a decent job of this, but the entire search file set will be read.
>> > To stop reading the entire search file, a record reader or join type,
>> would
>> > need to be put together to:
>> > a) skip to the first key of interest, using the index if available
>> > b) finish when the last possible key of interest has been delivered.
>> >
>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j....@gmail.com>
>> wrote:
>> >
>> > > In addition to other suggestions, you could also take a look at
>> > > building a Cascading job with a custom Joiner class.
>> > >
>> > > - John
>> > >
>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <ta...@gmail.com>
>> > > wrote:
>> > > > Hi,
>> > > >
>> > > > We need to implement a Join with a between operator instead of an
>> > equal.
>> > > > What we are trying to do is search a file for a key where the key
>> falls
>> > > > between two fields in the search file like this:
>> > > >
>> > > > main file (ip, a, b):
>> > > > (80, zz, yy)
>> > > > (125, vv, bb)
>> > > >
>> > > > search file (from-ip, to-ip, d, e):
>> > > > (52, 75, xxx, yyy)
>> > > > (78, 98, aaa, bbb)
>> > > > (99, 115, xxx, ddd)
>> > > > (125, 130, hhh, aaa)
>> > > > (150, 162, qqq, sss)
>> > > >
>> > > > the outcome should be in the form (ip, a, b, d, e):
>> > > > (80, zz, yy, aaa, bbb)
>> > > > (125, vv, bb, eee, hhh)
>> > > >
>> > > > We could convert the ip ranges in the search file to single record
>> ips
>> > > and
>> > > > then do a regular join, but the number of single ips is huge and
>> this
>> > is
>> > > > probably not a good way.
>> > > > What would be a good course for doing this in hadoop ?
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Tamir
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Alpha Chapters of my book on Hadoop are available
>> > http://www.apress.com/book/view/9781430219422
>> >
>>
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Join Variation

Posted by jason hadoop <ja...@gmail.com>.
For the classic map/reduce job, you have 3 requirements.

1) a comparator that provide the keys in ip address order, such that all
keys in one of your ranges, would be contiguous, when sorted with the
comparator
2) a partitioner that ensures that all keys that should be together end up
in the same partition
3) and output value grouping comparator that considered all keys in a
specified range equal.

The comparator only sorts by the first part of the key, the search file has
a 2 part key begin/end the input data has just a 1 part key.

A partitioner that new ahead of time the group sets in your search set, in
the way that the tera sort example works would be ideal:
ie: it builds an index of ranges from your seen set so that the ranges get
rougly evenly split between your reduces.
This requires a pass over the search file to write out a summary file, which
is then loaded by the partitioner.

The output value grouping comparator, will get the keys in order of the
first token, and will define the start of a group by the presence of a 2
part key, and consider the group ended when either another 2 part key
appears, or when the key value is larger than the second part of the
starting key. - This does require that the grouping comparator maintain
state.

At this point, your reduce will be called with the first key in the key
equivalence group of (3), with the values of all of the keys

In your map, any address that is not in a range of interest is not passed to
output.collect.

For the map side join code, you have to define a comparator on the key type
that defines your definition of equivalence and ordering, and call
WritableComparator.define( Key.class, comparator.class ), to force the join
code to use your comparator.

For tables with duplicates, per the key comparator, in map side join, your
map fuction will receive a row for every permutation of the duplicate keys:
if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;


On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <ta...@gmail.com>wrote:

> Thanks for all who replies.
>
> Stefan -
> I'm unable to see how converting IP ranges to network masks would help
> because different ranges can have the same network mask and with that I
> still have to do a comparison of two fields: the searched IP with
> from-IP&mask.
>
> Pig - I'm familier with pig and use it many times, but I can't think of a
> way to write a pig script that will do this type of "join". I'll ask the
> pig
> users group.
>
> The search file is indeed large in terms of the amount records. However, I
> don't see this as an issue yet, because I'm still puzzeled with how to
> write
> the job in plain MR. The join code is looking for an exact match in the
> keys
> and that is not what I need. Would a custom comperator which will look for
> a
> match in between the ranges, be the right choice to do this ?
>
> Thanks,
> Tamir
>
> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.hadoop@gmail.com
> >wrote:
>
> > If the search file data set is large, the issue becomes ensuring that
> only
> > the required portion of search file is actually read, and that those
> reads
> > are ordered, in search file's key order.
> >
> > If the data set is small, most any of the common patterns will work.
> >
> > I haven't looked at pig for a while, does pig now use indexes in map
> files,
> > and take into account that a data set is sorted?
> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
> will
> > do a decent job of this, but the entire search file set will be read.
> > To stop reading the entire search file, a record reader or join type,
> would
> > need to be put together to:
> > a) skip to the first key of interest, using the index if available
> > b) finish when the last possible key of interest has been delivered.
> >
> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j....@gmail.com>
> wrote:
> >
> > > In addition to other suggestions, you could also take a look at
> > > building a Cascading job with a custom Joiner class.
> > >
> > > - John
> > >
> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <ta...@gmail.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > We need to implement a Join with a between operator instead of an
> > equal.
> > > > What we are trying to do is search a file for a key where the key
> falls
> > > > between two fields in the search file like this:
> > > >
> > > > main file (ip, a, b):
> > > > (80, zz, yy)
> > > > (125, vv, bb)
> > > >
> > > > search file (from-ip, to-ip, d, e):
> > > > (52, 75, xxx, yyy)
> > > > (78, 98, aaa, bbb)
> > > > (99, 115, xxx, ddd)
> > > > (125, 130, hhh, aaa)
> > > > (150, 162, qqq, sss)
> > > >
> > > > the outcome should be in the form (ip, a, b, d, e):
> > > > (80, zz, yy, aaa, bbb)
> > > > (125, vv, bb, eee, hhh)
> > > >
> > > > We could convert the ip ranges in the search file to single record
> ips
> > > and
> > > > then do a regular join, but the number of single ips is huge and this
> > is
> > > > probably not a good way.
> > > > What would be a good course for doing this in hadoop ?
> > > >
> > > >
> > > > Thanks,
> > > > Tamir
> > > >
> > >
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Join Variation

Posted by Tamir Kamara <ta...@gmail.com>.
Thanks for all who replies.

Stefan -
I'm unable to see how converting IP ranges to network masks would help
because different ranges can have the same network mask and with that I
still have to do a comparison of two fields: the searched IP with
from-IP&mask.

Pig - I'm familier with pig and use it many times, but I can't think of a
way to write a pig script that will do this type of "join". I'll ask the pig
users group.

The search file is indeed large in terms of the amount records. However, I
don't see this as an issue yet, because I'm still puzzeled with how to write
the job in plain MR. The join code is looking for an exact match in the keys
and that is not what I need. Would a custom comperator which will look for a
match in between the ranges, be the right choice to do this ?

Thanks,
Tamir

On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <ja...@gmail.com>wrote:

> If the search file data set is large, the issue becomes ensuring that only
> the required portion of search file is actually read, and that those reads
> are ordered, in search file's key order.
>
> If the data set is small, most any of the common patterns will work.
>
> I haven't looked at pig for a while, does pig now use indexes in map files,
> and take into account that a data set is sorted?
> Out of the box, the map side join code, org.apache.hadoop.mapred.join will
> do a decent job of this, but the entire search file set will be read.
> To stop reading the entire search file, a record reader or join type, would
> need to be put together to:
> a) skip to the first key of interest, using the index if available
> b) finish when the last possible key of interest has been delivered.
>
> On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j....@gmail.com> wrote:
>
> > In addition to other suggestions, you could also take a look at
> > building a Cascading job with a custom Joiner class.
> >
> > - John
> >
> > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <ta...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > We need to implement a Join with a between operator instead of an
> equal.
> > > What we are trying to do is search a file for a key where the key falls
> > > between two fields in the search file like this:
> > >
> > > main file (ip, a, b):
> > > (80, zz, yy)
> > > (125, vv, bb)
> > >
> > > search file (from-ip, to-ip, d, e):
> > > (52, 75, xxx, yyy)
> > > (78, 98, aaa, bbb)
> > > (99, 115, xxx, ddd)
> > > (125, 130, hhh, aaa)
> > > (150, 162, qqq, sss)
> > >
> > > the outcome should be in the form (ip, a, b, d, e):
> > > (80, zz, yy, aaa, bbb)
> > > (125, vv, bb, eee, hhh)
> > >
> > > We could convert the ip ranges in the search file to single record ips
> > and
> > > then do a regular join, but the number of single ips is huge and this
> is
> > > probably not a good way.
> > > What would be a good course for doing this in hadoop ?
> > >
> > >
> > > Thanks,
> > > Tamir
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>

Re: Join Variation

Posted by jason hadoop <ja...@gmail.com>.
If the search file data set is large, the issue becomes ensuring that only
the required portion of search file is actually read, and that those reads
are ordered, in search file's key order.

If the data set is small, most any of the common patterns will work.

I haven't looked at pig for a while, does pig now use indexes in map files,
and take into account that a data set is sorted?
Out of the box, the map side join code, org.apache.hadoop.mapred.join will
do a decent job of this, but the entire search file set will be read.
To stop reading the entire search file, a record reader or join type, would
need to be put together to:
a) skip to the first key of interest, using the index if available
b) finish when the last possible key of interest has been delivered.

On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j....@gmail.com> wrote:

> In addition to other suggestions, you could also take a look at
> building a Cascading job with a custom Joiner class.
>
> - John
>
> On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <ta...@gmail.com>
> wrote:
> > Hi,
> >
> > We need to implement a Join with a between operator instead of an equal.
> > What we are trying to do is search a file for a key where the key falls
> > between two fields in the search file like this:
> >
> > main file (ip, a, b):
> > (80, zz, yy)
> > (125, vv, bb)
> >
> > search file (from-ip, to-ip, d, e):
> > (52, 75, xxx, yyy)
> > (78, 98, aaa, bbb)
> > (99, 115, xxx, ddd)
> > (125, 130, hhh, aaa)
> > (150, 162, qqq, sss)
> >
> > the outcome should be in the form (ip, a, b, d, e):
> > (80, zz, yy, aaa, bbb)
> > (125, vv, bb, eee, hhh)
> >
> > We could convert the ip ranges in the search file to single record ips
> and
> > then do a regular join, but the number of single ips is huge and this is
> > probably not a good way.
> > What would be a good course for doing this in hadoop ?
> >
> >
> > Thanks,
> > Tamir
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Join Variation

Posted by John Lee <j....@gmail.com>.
In addition to other suggestions, you could also take a look at
building a Cascading job with a custom Joiner class.

- John

On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <ta...@gmail.com> wrote:
> Hi,
>
> We need to implement a Join with a between operator instead of an equal.
> What we are trying to do is search a file for a key where the key falls
> between two fields in the search file like this:
>
> main file (ip, a, b):
> (80, zz, yy)
> (125, vv, bb)
>
> search file (from-ip, to-ip, d, e):
> (52, 75, xxx, yyy)
> (78, 98, aaa, bbb)
> (99, 115, xxx, ddd)
> (125, 130, hhh, aaa)
> (150, 162, qqq, sss)
>
> the outcome should be in the form (ip, a, b, d, e):
> (80, zz, yy, aaa, bbb)
> (125, vv, bb, eee, hhh)
>
> We could convert the ip ranges in the search file to single record ips and
> then do a regular join, but the number of single ips is huge and this is
> probably not a good way.
> What would be a good course for doing this in hadoop ?
>
>
> Thanks,
> Tamir
>

Re: Join Variation

Posted by Peeyush Bishnoi <pe...@yahoo-inc.com>.
Hello Tamir ,

I think the better and simple way of doing this through Pig. 

http://wiki.apache.org/pig/PigOverview 

As Pig provides SQL type of interface over Hadoop  and support the kind
of operation you need to do with data quite easily.


Thanks ,

---
Peeyush

On Tue, 2009-03-24 at 13:33 +0200, Tamir Kamara wrote:

> Hi,
> 
> We need to implement a Join with a between operator instead of an equal.
> What we are trying to do is search a file for a key where the key falls
> between two fields in the search file like this:
> 
> main file (ip, a, b):
> (80, zz, yy)
> (125, vv, bb)
> 
> search file (from-ip, to-ip, d, e):
> (52, 75, xxx, yyy)
> (78, 98, aaa, bbb)
> (99, 115, xxx, ddd)
> (125, 130, hhh, aaa)
> (150, 162, qqq, sss)
> 
> the outcome should be in the form (ip, a, b, d, e):
> (80, zz, yy, aaa, bbb)
> (125, vv, bb, eee, hhh)
> 
> We could convert the ip ranges in the search file to single record ips and
> then do a regular join, but the number of single ips is huge and this is
> probably not a good way.
> What would be a good course for doing this in hadoop ?
> 
> 
> Thanks,
> Tamir

Re: Join Variation

Posted by Stefan Podkowinski <sp...@gmail.com>.
Have you considered hbase for this particular task?
Looks like a simple lookup using the network mask as key would solve
your problem.

Its also possible to derive the network class (A,B,C) based on the
network class of the concerned ip. But I guess your "search file" will
cover ranges in more detail than just on class level.

On Tue, Mar 24, 2009 at 12:33 PM, Tamir Kamara <ta...@gmail.com> wrote:
> Hi,
>
> We need to implement a Join with a between operator instead of an equal.
> What we are trying to do is search a file for a key where the key falls
> between two fields in the search file like this:
>
> main file (ip, a, b):
> (80, zz, yy)
> (125, vv, bb)
>
> search file (from-ip, to-ip, d, e):
> (52, 75, xxx, yyy)
> (78, 98, aaa, bbb)
> (99, 115, xxx, ddd)
> (125, 130, hhh, aaa)
> (150, 162, qqq, sss)
>
> the outcome should be in the form (ip, a, b, d, e):
> (80, zz, yy, aaa, bbb)
> (125, vv, bb, eee, hhh)
>
> We could convert the ip ranges in the search file to single record ips and
> then do a regular join, but the number of single ips is huge and this is
> probably not a good way.
> What would be a good course for doing this in hadoop ?
>
>
> Thanks,
> Tamir
>