You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Oliver B. Fischer" <ma...@swe-blog.net> on 2012/09/26 15:19:26 UTC

Programming Question / Joining Dataset

Hi all,

I have to join to large datasets A and B. I preprocess both datasets by 
parsing the source text files and creating custom datatypes ADT and BDT 
out ouf it.

Now I have to join theses data. Both databsets A' and B' already have 
the same datatype as key. But how can I pass both custom datatypes ADT 
and BDT to the same reducer instance for joining?

Bye,

Oliver

Re: Programming Question / Joining Dataset

Posted by Bertrand Dechoux <de...@gmail.com>.

A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...

Regards

Bertrand

On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>>     Hi all,
>>
>>     I have to join to large datasets A and B. I preprocess both datasets
>>     by parsing the source text files and creating custom datatypes ADT
>>     and BDT out ouf it.
>>
>>     Now I have to join theses data. Both databsets A' and B' already
>>     have the same datatype as key. But how can I pass both custom
>>     datatypes ADT and BDT to the same reducer instance for joining?
>>
>>     Bye,
>>
>>     Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>


-- 
Bertrand Dechoux

Re: Programming Question / Joining Dataset

Posted by Kai Voigt <k...@123.org>.

The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.

Kai

Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:

> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
> 
> Bye,
> 
> Oliver
> 
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>> 
>> [1] https://cwiki.apache.org/Hive/home.html
>> 
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>> 
>>    Hi all,
>> 
>>    I have to join to large datasets A and B. I preprocess both datasets
>>    by parsing the source text files and creating custom datatypes ADT
>>    and BDT out ouf it.
>> 
>>    Now I have to join theses data. Both databsets A' and B' already
>>    have the same datatype as key. But how can I pass both custom
>>    datatypes ADT and BDT to the same reducer instance for joining?
>> 
>>    Bye,
>> 
>>    Oliver
>> 
>> 
>> 
>> 
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
> 

-- 
Kai Voigt
k@123.org

Re: Programming Question / Joining Dataset

Posted by Kai Voigt <k...@123.org>.

The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.

Kai

Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:

> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
> 
> Bye,
> 
> Oliver
> 
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>> 
>> [1] https://cwiki.apache.org/Hive/home.html
>> 
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>> 
>>    Hi all,
>> 
>>    I have to join to large datasets A and B. I preprocess both datasets
>>    by parsing the source text files and creating custom datatypes ADT
>>    and BDT out ouf it.
>> 
>>    Now I have to join theses data. Both databsets A' and B' already
>>    have the same datatype as key. But how can I pass both custom
>>    datatypes ADT and BDT to the same reducer instance for joining?
>> 
>>    Bye,
>> 
>>    Oliver
>> 
>> 
>> 
>> 
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
> 

-- 
Kai Voigt
k@123.org

Re: Programming Question / Joining Dataset

Posted by Bertrand Dechoux <de...@gmail.com>.

A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...

Regards

Bertrand

On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>>     Hi all,
>>
>>     I have to join to large datasets A and B. I preprocess both datasets
>>     by parsing the source text files and creating custom datatypes ADT
>>     and BDT out ouf it.
>>
>>     Now I have to join theses data. Both databsets A' and B' already
>>     have the same datatype as key. But how can I pass both custom
>>     datatypes ADT and BDT to the same reducer instance for joining?
>>
>>     Bye,
>>
>>     Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>


-- 
Bertrand Dechoux

Re: Programming Question / Joining Dataset

Posted by Bertrand Dechoux <de...@gmail.com>.

A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...

Regards

Bertrand

On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>>     Hi all,
>>
>>     I have to join to large datasets A and B. I preprocess both datasets
>>     by parsing the source text files and creating custom datatypes ADT
>>     and BDT out ouf it.
>>
>>     Now I have to join theses data. Both databsets A' and B' already
>>     have the same datatype as key. But how can I pass both custom
>>     datatypes ADT and BDT to the same reducer instance for joining?
>>
>>     Bye,
>>
>>     Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>


-- 
Bertrand Dechoux

Re: Programming Question / Joining Dataset

Posted by Kai Voigt <k...@123.org>.

The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.

Kai

Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:

> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
> 
> Bye,
> 
> Oliver
> 
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>> 
>> [1] https://cwiki.apache.org/Hive/home.html
>> 
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>> 
>>    Hi all,
>> 
>>    I have to join to large datasets A and B. I preprocess both datasets
>>    by parsing the source text files and creating custom datatypes ADT
>>    and BDT out ouf it.
>> 
>>    Now I have to join theses data. Both databsets A' and B' already
>>    have the same datatype as key. But how can I pass both custom
>>    datatypes ADT and BDT to the same reducer instance for joining?
>> 
>>    Bye,
>> 
>>    Oliver
>> 
>> 
>> 
>> 
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
> 

-- 
Kai Voigt
k@123.org

Re: Programming Question / Joining Dataset

Posted by Kai Voigt <k...@123.org>.

The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.

Kai

Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:

> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
> 
> Bye,
> 
> Oliver
> 
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>> 
>> [1] https://cwiki.apache.org/Hive/home.html
>> 
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>> 
>>    Hi all,
>> 
>>    I have to join to large datasets A and B. I preprocess both datasets
>>    by parsing the source text files and creating custom datatypes ADT
>>    and BDT out ouf it.
>> 
>>    Now I have to join theses data. Both databsets A' and B' already
>>    have the same datatype as key. But how can I pass both custom
>>    datatypes ADT and BDT to the same reducer instance for joining?
>> 
>>    Bye,
>> 
>>    Oliver
>> 
>> 
>> 
>> 
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
> 

-- 
Kai Voigt
k@123.org

Re: Programming Question / Joining Dataset

Posted by Bertrand Dechoux <de...@gmail.com>.

A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...

Regards

Bertrand

On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>>     Hi all,
>>
>>     I have to join to large datasets A and B. I preprocess both datasets
>>     by parsing the source text files and creating custom datatypes ADT
>>     and BDT out ouf it.
>>
>>     Now I have to join theses data. Both databsets A' and B' already
>>     have the same datatype as key. But how can I pass both custom
>>     datatypes ADT and BDT to the same reducer instance for joining?
>>
>>     Bye,
>>
>>     Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>


-- 
Bertrand Dechoux

Re: Programming Question / Joining Dataset

Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.

Yes I know Hive and also Pig. Both are suitable for my problems but 
before starting with one of them I simply would like to know how to do 
it with pure MR. ;-)

Bye,

Oliver

On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
>     Hi all,
>
>     I have to join to large datasets A and B. I preprocess both datasets
>     by parsing the source text files and creating custom datatypes ADT
>     and BDT out ouf it.
>
>     Now I have to join theses data. Both databsets A' and B' already
>     have the same datatype as key. But how can I pass both custom
>     datatypes ADT and BDT to the same reducer instance for joining?
>
>     Bye,
>
>     Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>

Re: Programming Question / Joining Dataset

Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.

Yes I know Hive and also Pig. Both are suitable for my problems but 
before starting with one of them I simply would like to know how to do 
it with pure MR. ;-)

Bye,

Oliver

On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
>     Hi all,
>
>     I have to join to large datasets A and B. I preprocess both datasets
>     by parsing the source text files and creating custom datatypes ADT
>     and BDT out ouf it.
>
>     Now I have to join theses data. Both databsets A' and B' already
>     have the same datatype as key. But how can I pass both custom
>     datatypes ADT and BDT to the same reducer instance for joining?
>
>     Bye,
>
>     Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>

Re: Programming Question / Joining Dataset

Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.

Yes I know Hive and also Pig. Both are suitable for my problems but 
before starting with one of them I simply would like to know how to do 
it with pure MR. ;-)

Bye,

Oliver

On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
>     Hi all,
>
>     I have to join to large datasets A and B. I preprocess both datasets
>     by parsing the source text files and creating custom datatypes ADT
>     and BDT out ouf it.
>
>     Now I have to join theses data. Both databsets A' and B' already
>     have the same datatype as key. But how can I pass both custom
>     datatypes ADT and BDT to the same reducer instance for joining?
>
>     Bye,
>
>     Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>

Re: Programming Question / Joining Dataset

Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.

Yes I know Hive and also Pig. Both are suitable for my problems but 
before starting with one of them I simply would like to know how to do 
it with pure MR. ;-)

Bye,

Oliver

On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
>     Hi all,
>
>     I have to join to large datasets A and B. I preprocess both datasets
>     by parsing the source text files and creating custom datatypes ADT
>     and BDT out ouf it.
>
>     Now I have to join theses data. Both databsets A' and B' already
>     have the same datatype as key. But how can I pass both custom
>     datatypes ADT and BDT to the same reducer instance for joining?
>
>     Bye,
>
>     Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>

Re: Programming Question / Joining Dataset

Posted by bharath vissapragada <bh...@gmail.com>.

Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)

[1] https://cwiki.apache.org/Hive/home.html

On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Programming Question / Joining Dataset

Posted by Bejoy Ks <be...@gmail.com>.

Hi Oliver

I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

Regards
Bejoy KS

Re: Programming Question / Joining Dataset

Posted by Bejoy Ks <be...@gmail.com>.

Hi Oliver

I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

Regards
Bejoy KS

Re: Programming Question / Joining Dataset

Posted by Bejoy Ks <be...@gmail.com>.

Hi Oliver

I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

Regards
Bejoy KS

Re: Programming Question / Joining Dataset

Posted by Bejoy Ks <be...@gmail.com>.

Hi Oliver

I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

Regards
Bejoy KS

Re: Programming Question / Joining Dataset

Posted by Jay Vyas <ja...@gmail.com>.

1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....

2) The idiomatic way is to use a serialization framework like
avro/thrift/...  This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).

So to exemplify 1:

Your Reducer signature might look like this:

Reducer<Text,MapWritable,Text,Integer>

and in your

reduce(Text key ,Iterator<MapWritable> values){

adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}

public class ADTA {
Integer age;

static ADTA readFromMap(MapWritable m){
  age = Integer.parseInt(m.get("age"));
 }
}

Re: Programming Question / Joining Dataset

Posted by Jay Vyas <ja...@gmail.com>.

1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....

2) The idiomatic way is to use a serialization framework like
avro/thrift/...  This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).

So to exemplify 1:

Your Reducer signature might look like this:

Reducer<Text,MapWritable,Text,Integer>

and in your

reduce(Text key ,Iterator<MapWritable> values){

adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}

public class ADTA {
Integer age;

static ADTA readFromMap(MapWritable m){
  age = Integer.parseInt(m.get("age"));
 }
}

Re: Programming Question / Joining Dataset

Posted by Jay Vyas <ja...@gmail.com>.

1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....

2) The idiomatic way is to use a serialization framework like
avro/thrift/...  This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).

So to exemplify 1:

Your Reducer signature might look like this:

Reducer<Text,MapWritable,Text,Integer>

and in your

reduce(Text key ,Iterator<MapWritable> values){

adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}

public class ADTA {
Integer age;

static ADTA readFromMap(MapWritable m){
  age = Integer.parseInt(m.get("age"));
 }
}

Re: Programming Question / Joining Dataset

Posted by bharath vissapragada <bh...@gmail.com>.

Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)

[1] https://cwiki.apache.org/Hive/home.html

On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Programming Question / Joining Dataset

Posted by Jay Vyas <ja...@gmail.com>.

1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....

2) The idiomatic way is to use a serialization framework like
avro/thrift/...  This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).

So to exemplify 1:

Your Reducer signature might look like this:

Reducer<Text,MapWritable,Text,Integer>

and in your

reduce(Text key ,Iterator<MapWritable> values){

adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}

public class ADTA {
Integer age;

static ADTA readFromMap(MapWritable m){
  age = Integer.parseInt(m.get("age"));
 }
}

Re: Programming Question / Joining Dataset

Posted by bharath vissapragada <bh...@gmail.com>.

Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)

[1] https://cwiki.apache.org/Hive/home.html

On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Programming Question / Joining Dataset

Posted by bharath vissapragada <bh...@gmail.com>.

Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)

[1] https://cwiki.apache.org/Hive/home.html

On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:

> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v