You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Oliver B. Fischer" <ma...@swe-blog.net> on 2012/09/26 15:19:26 UTC
Programming Question / Joining Dataset
Hi all,
I have to join to large datasets A and B. I preprocess both datasets by
parsing the source text files and creating custom datatypes ADT and BDT
out ouf it.
Now I have to join theses data. Both databsets A' and B' already have
the same datatype as key. But how can I pass both custom datatypes ADT
and BDT to the same reducer instance for joining?
Bye,
Oliver
Re: Programming Question / Joining Dataset
Posted by Bertrand Dechoux <de...@gmail.com>.
A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...
Regards
Bertrand
On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>
--
Bertrand Dechoux
Re: Programming Question / Joining Dataset
Posted by Kai Voigt <k...@123.org>.
The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.
Kai
Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:
> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
>
> Bye,
>
> Oliver
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/home.html
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
>
--
Kai Voigt
k@123.org
Re: Programming Question / Joining Dataset
Posted by Kai Voigt <k...@123.org>.
The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.
Kai
Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:
> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
>
> Bye,
>
> Oliver
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/home.html
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
>
--
Kai Voigt
k@123.org
Re: Programming Question / Joining Dataset
Posted by Bertrand Dechoux <de...@gmail.com>.
A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...
Regards
Bertrand
On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>
--
Bertrand Dechoux
Re: Programming Question / Joining Dataset
Posted by Bertrand Dechoux <de...@gmail.com>.
A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...
Regards
Bertrand
On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>
--
Bertrand Dechoux
Re: Programming Question / Joining Dataset
Posted by Kai Voigt <k...@123.org>.
The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.
Kai
Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:
> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
>
> Bye,
>
> Oliver
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/home.html
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
>
--
Kai Voigt
k@123.org
Re: Programming Question / Joining Dataset
Posted by Kai Voigt <k...@123.org>.
The design pattern for this is called "Reduce-side Join". Enter it into Google and you will get a lot of details.
Kai
Am 26.09.2012 um 15:39 schrieb "Oliver B. Fischer" <ma...@swe-blog.net>:
> Yes I know Hive and also Pig. Both are suitable for my problems but before starting with one of them I simply would like to know how to do it with pure MR. ;-)
>
> Bye,
>
> Oliver
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/home.html
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>> <http://researchweb.iiit.ac.in/%7Ebharath.v>
>
--
Kai Voigt
k@123.org
Re: Programming Question / Joining Dataset
Posted by Bertrand Dechoux <de...@gmail.com>.
A container type with generics in order to help developers.
It works but it is code that you would rather not write and maintain.
That's why solutions with a higher abstraction are required.
Like always, "Premature optimization is the root of all evil" and in most
cases the safer bet is hive, pig, cascading...
Regards
Bertrand
On Wed, Sep 26, 2012 at 3:39 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Yes I know Hive and also Pig. Both are suitable for my problems but before
> starting with one of them I simply would like to know how to do it with
> pure MR. ;-)
>
> Bye,
>
> Oliver
>
>
> On 09/26/2012 03:36 PM, bharath vissapragada wrote:
>
>> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
>> can provide your custom SerDes, to read your file format (to avoid
>> pre-processing) and also create your own data types, (For eg: Map of
>> Maps,Arrays etc)
>>
>> [1] https://cwiki.apache.org/Hive/**home.html<https://cwiki.apache.org/Hive/home.html>
>>
>> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
>> <mailsink@swe-blog.net <ma...@swe-blog.net>**> wrote:
>>
>> Hi all,
>>
>> I have to join to large datasets A and B. I preprocess both datasets
>> by parsing the source text files and creating custom datatypes ADT
>> and BDT out ouf it.
>>
>> Now I have to join theses data. Both databsets A' and B' already
>> have the same datatype as key. But how can I pass both custom
>> datatypes ADT and BDT to the same reducer instance for joining?
>>
>> Bye,
>>
>> Oliver
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.**in/~bharath.v<http://researchweb.iiit.ac.in/~bharath.v>
>> <http://researchweb.iiit.ac.**in/%7Ebharath.v<http://researchweb.iiit.ac.in/%7Ebharath.v>
>> >
>>
>
--
Bertrand Dechoux
Re: Programming Question / Joining Dataset
Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.
Yes I know Hive and also Pig. Both are suitable for my problems but
before starting with one of them I simply would like to know how to do
it with pure MR. ;-)
Bye,
Oliver
On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets
> by parsing the source text files and creating custom datatypes ADT
> and BDT out ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already
> have the same datatype as key. But how can I pass both custom
> datatypes ADT and BDT to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>
Re: Programming Question / Joining Dataset
Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.
Yes I know Hive and also Pig. Both are suitable for my problems but
before starting with one of them I simply would like to know how to do
it with pure MR. ;-)
Bye,
Oliver
On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets
> by parsing the source text files and creating custom datatypes ADT
> and BDT out ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already
> have the same datatype as key. But how can I pass both custom
> datatypes ADT and BDT to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>
Re: Programming Question / Joining Dataset
Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.
Yes I know Hive and also Pig. Both are suitable for my problems but
before starting with one of them I simply would like to know how to do
it with pure MR. ;-)
Bye,
Oliver
On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets
> by parsing the source text files and creating custom datatypes ADT
> and BDT out ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already
> have the same datatype as key. But how can I pass both custom
> datatypes ADT and BDT to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>
Re: Programming Question / Joining Dataset
Posted by "Oliver B. Fischer" <ma...@swe-blog.net>.
Yes I know Hive and also Pig. Both are suitable for my problems but
before starting with one of them I simply would like to know how to do
it with pure MR. ;-)
Bye,
Oliver
On 09/26/2012 03:36 PM, bharath vissapragada wrote:
> Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you
> can provide your custom SerDes, to read your file format (to avoid
> pre-processing) and also create your own data types, (For eg: Map of
> Maps,Arrays etc)
>
> [1] https://cwiki.apache.org/Hive/home.html
>
> On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer
> <mailsink@swe-blog.net <ma...@swe-blog.net>> wrote:
>
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets
> by parsing the source text files and creating custom datatypes ADT
> and BDT out ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already
> have the same datatype as key. But how can I pass both custom
> datatypes ADT and BDT to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> <http://researchweb.iiit.ac.in/%7Ebharath.v>
Re: Programming Question / Joining Dataset
Posted by bharath vissapragada <bh...@gmail.com>.
Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)
[1] https://cwiki.apache.org/Hive/home.html
On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
--
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v
Re: Programming Question / Joining Dataset
Posted by Bejoy Ks <be...@gmail.com>.
Hi Oliver
I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html
Regards
Bejoy KS
Re: Programming Question / Joining Dataset
Posted by Bejoy Ks <be...@gmail.com>.
Hi Oliver
I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html
Regards
Bejoy KS
Re: Programming Question / Joining Dataset
Posted by Bejoy Ks <be...@gmail.com>.
Hi Oliver
I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html
Regards
Bejoy KS
Re: Programming Question / Joining Dataset
Posted by Bejoy Ks <be...@gmail.com>.
Hi Oliver
I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html
Regards
Bejoy KS
Re: Programming Question / Joining Dataset
Posted by Jay Vyas <ja...@gmail.com>.
1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....
2) The idiomatic way is to use a serialization framework like
avro/thrift/... This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).
So to exemplify 1:
Your Reducer signature might look like this:
Reducer<Text,MapWritable,Text,Integer>
and in your
reduce(Text key ,Iterator<MapWritable> values){
adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}
public class ADTA {
Integer age;
static ADTA readFromMap(MapWritable m){
age = Integer.parseInt(m.get("age"));
}
}
Re: Programming Question / Joining Dataset
Posted by Jay Vyas <ja...@gmail.com>.
1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....
2) The idiomatic way is to use a serialization framework like
avro/thrift/... This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).
So to exemplify 1:
Your Reducer signature might look like this:
Reducer<Text,MapWritable,Text,Integer>
and in your
reduce(Text key ,Iterator<MapWritable> values){
adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}
public class ADTA {
Integer age;
static ADTA readFromMap(MapWritable m){
age = Integer.parseInt(m.get("age"));
}
}
Re: Programming Question / Joining Dataset
Posted by Jay Vyas <ja...@gmail.com>.
1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....
2) The idiomatic way is to use a serialization framework like
avro/thrift/... This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).
So to exemplify 1:
Your Reducer signature might look like this:
Reducer<Text,MapWritable,Text,Integer>
and in your
reduce(Text key ,Iterator<MapWritable> values){
adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}
public class ADTA {
Integer age;
static ADTA readFromMap(MapWritable m){
age = Integer.parseInt(m.get("age"));
}
}
Re: Programming Question / Joining Dataset
Posted by bharath vissapragada <bh...@gmail.com>.
Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)
[1] https://cwiki.apache.org/Hive/home.html
On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
--
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v
Re: Programming Question / Joining Dataset
Posted by Jay Vyas <ja...@gmail.com>.
1) One really easy but clumsy way is to just encode/decode the ADT/BDT as
MapWritable objects when you write them (in your maappers), and then read
them in the reducers, decoding them.....
2) The idiomatic way is to use a serialization framework like
avro/thrift/... This will take more work to get working, but in the long
run, your code will read like standard java (i.e. you can use java pojos
that are read/written by your Serialization framework, rather than the
limited Writable framework, which doesnt really support ADTs).
So to exemplify 1:
Your Reducer signature might look like this:
Reducer<Text,MapWritable,Text,Integer>
and in your
reduce(Text key ,Iterator<MapWritable> values){
adtA = ADTA.readFromMap(values.next());
adtB = ADTB.readFromMap(values.next());
//lets say we're summing the age of A and B and emitting as the final
value.
context.emit(key, adtA.getAge()+adtB.Age());
}
public class ADTA {
Integer age;
static ADTA readFromMap(MapWritable m){
age = Integer.parseInt(m.get("age"));
}
}
Re: Programming Question / Joining Dataset
Posted by bharath vissapragada <bh...@gmail.com>.
Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)
[1] https://cwiki.apache.org/Hive/home.html
On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
--
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v
Re: Programming Question / Joining Dataset
Posted by bharath vissapragada <bh...@gmail.com>.
Have you seen Hive[1] ? It can join DataSets over mapreduce . Also you can
provide your custom SerDes, to read your file format (to avoid
pre-processing) and also create your own data types, (For eg: Map of
Maps,Arrays etc)
[1] https://cwiki.apache.org/Hive/home.html
On Wed, Sep 26, 2012 at 6:49 PM, Oliver B. Fischer <ma...@swe-blog.net>wrote:
> Hi all,
>
> I have to join to large datasets A and B. I preprocess both datasets by
> parsing the source text files and creating custom datatypes ADT and BDT out
> ouf it.
>
> Now I have to join theses data. Both databsets A' and B' already have the
> same datatype as key. But how can I pass both custom datatypes ADT and BDT
> to the same reducer instance for joining?
>
> Bye,
>
> Oliver
>
--
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v