You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/01/26 02:43:45 UTC
Join Hadoop Example problem
Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application
within the hadoop-*examples.jar . I can't seem to figure it out, where am I
going wrong? It isn't grouping the keys together, as I would expect....
------------------------
> bin/hadoop dfs -cat join/a.txt
AAAAAAAA,a0
BBBBBBBB,a1
CCCCCCCC,a2
CCCCCCCC,a3
> bin/hadoop dfs -cat join/b.txt
AAAAAAAA,b0
BBBBBBBB,b1
BBBBBBBB,b2
BBBBBBBB,b3
> bin/hadoop dfs -cat join/c.txt
AAAAAAAA,c0
BBBBBBBB,c1
DDDDDDDD,c2
DDDDDDDD,c3
>
-----*RESULT*-----
>bin/hadoop dfs -text theOutputs/part-00000
AAAAAAAA [a0]
AAAAAAAA [b0]
AAAAAAAA [c0]
BBBBBBBB [c1]
BBBBBBBB [a1]
BBBBBBBB [b1]
BBBBBBBB [b2]
BBBBBBBB [b3]
CCCCCCCC [a2]
CCCCCCCC [a3]
DDDDDDDD [c2]
DDDDDDDD [c3]
-----------------------
So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks
like this:
AAAAAAAA [a0,b0,c0]
BBBBBBBB [a1,b1,c1]
BBBBBBBB [a1,b2,c1]
BBBBBBBB [a1,b3,c1]
CCCCCCCC [a2,,]
CCCCCCCC [a3,,]
DDDDDDDD [,,c2]
DDDDDDDD [,,c3]
?
---------------------
I have another question. Instead of these Key/Value pairs, what if I
have two input files list1.txt and list2.txt, both containing a list
of names, one line per name. I want to JOIN these input files BY the
names in each list. i.e. I want to create an output file containing a
list of the names that appear in both the input lists. Is it possible
to adapt the Join example packaged with Hadoop to implement this?
Many thanks,
Rob Stewart
Re: Join Hadoop Example problem
Posted by Alex Kozlov <al...@cloudera.com>.
Hi Rob, When you give Hive a directory name, it treats all the files as a
single table (kind of counterintuitive, but very helpful if you work with
large data sets). Try to create 3 separate directories:
tablea/a.txt
tableb/b.txt
tablec/c.txt
and run the query as:
> bin/hadoop jar hadoop-*-examples.jar join -D
key.value.separator.in.input.line=',' -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey
org.apache.hadoop.io.Text join tablea tableb tablec theOutputs
Alex K
On Mon, Jan 25, 2010 at 6:25 PM, Rob Stewart <ro...@googlemail.com>wrote:
> Good point, I missed that. It is:
>
> > bin/hadoop jar hadoop-*-examples.jar join -D
> key.value.separator.in.input.line=',' -inFormat
> org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey
> org.apache.hadoop.io.Text join/ theOutputs
>
> Rob
>
>
> 2010/1/26 abhishek sharma <ab...@usc.edu>
>
> > What is the exact command that you are giving when submitting the
> > jobs? I did not see it in your e-mail.
> >
> > Abhishek
> >
> > On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
> > <ro...@googlemail.com> wrote:
> > > Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join
> > application
> > > within the hadoop-*examples.jar . I can't seem to figure it out, where
> am
> > I
> > > going wrong? It isn't grouping the keys together, as I would expect....
> > > ------------------------
> > >> bin/hadoop dfs -cat join/a.txt
> > > AAAAAAAA,a0
> > > BBBBBBBB,a1
> > > CCCCCCCC,a2
> > > CCCCCCCC,a3
> > >
> > >> bin/hadoop dfs -cat join/b.txt
> > > AAAAAAAA,b0
> > > BBBBBBBB,b1
> > > BBBBBBBB,b2
> > > BBBBBBBB,b3
> > >
> > >> bin/hadoop dfs -cat join/c.txt
> > > AAAAAAAA,c0
> > > BBBBBBBB,c1
> > > DDDDDDDD,c2
> > > DDDDDDDD,c3
> > >
> > >>
> > >
> > > -----*RESULT*-----
> > >>bin/hadoop dfs -text theOutputs/part-00000
> > > AAAAAAAA [a0]
> > > AAAAAAAA [b0]
> > > AAAAAAAA [c0]
> > > BBBBBBBB [c1]
> > > BBBBBBBB [a1]
> > > BBBBBBBB [b1]
> > > BBBBBBBB [b2]
> > > BBBBBBBB [b3]
> > > CCCCCCCC [a2]
> > > CCCCCCCC [a3]
> > > DDDDDDDD [c2]
> > > DDDDDDDD [c3]
> > > -----------------------
> > >
> > >
> > > So, why has it not grouped all the AAAAAAAA's etc so that it, instead
> > looks
> > > like this:
> > >
> > > AAAAAAAA [a0,b0,c0]
> > > BBBBBBBB [a1,b1,c1]
> > > BBBBBBBB [a1,b2,c1]
> > > BBBBBBBB [a1,b3,c1]
> > > CCCCCCCC [a2,,]
> > > CCCCCCCC [a3,,]
> > > DDDDDDDD [,,c2]
> > > DDDDDDDD [,,c3]
> > >
> > > ?
> > >
> > > ---------------------
> > >
> > > I have another question. Instead of these Key/Value pairs, what if I
> > > have two input files list1.txt and list2.txt, both containing a list
> > > of names, one line per name. I want to JOIN these input files BY the
> > > names in each list. i.e. I want to create an output file containing a
> > > list of the names that appear in both the input lists. Is it possible
> > > to adapt the Join example packaged with Hadoop to implement this?
> > >
> > >
> > > Many thanks,
> > >
> > > Rob Stewart
> > >
> >
>
Re: Join Hadoop Example problem
Posted by Rob Stewart <ro...@googlemail.com>.
Good point, I missed that. It is:
> bin/hadoop jar hadoop-*-examples.jar join -D
key.value.separator.in.input.line=',' -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey
org.apache.hadoop.io.Text join/ theOutputs
Rob
2010/1/26 abhishek sharma <ab...@usc.edu>
> What is the exact command that you are giving when submitting the
> jobs? I did not see it in your e-mail.
>
> Abhishek
>
> On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
> <ro...@googlemail.com> wrote:
> > Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join
> application
> > within the hadoop-*examples.jar . I can't seem to figure it out, where am
> I
> > going wrong? It isn't grouping the keys together, as I would expect....
> > ------------------------
> >> bin/hadoop dfs -cat join/a.txt
> > AAAAAAAA,a0
> > BBBBBBBB,a1
> > CCCCCCCC,a2
> > CCCCCCCC,a3
> >
> >> bin/hadoop dfs -cat join/b.txt
> > AAAAAAAA,b0
> > BBBBBBBB,b1
> > BBBBBBBB,b2
> > BBBBBBBB,b3
> >
> >> bin/hadoop dfs -cat join/c.txt
> > AAAAAAAA,c0
> > BBBBBBBB,c1
> > DDDDDDDD,c2
> > DDDDDDDD,c3
> >
> >>
> >
> > -----*RESULT*-----
> >>bin/hadoop dfs -text theOutputs/part-00000
> > AAAAAAAA [a0]
> > AAAAAAAA [b0]
> > AAAAAAAA [c0]
> > BBBBBBBB [c1]
> > BBBBBBBB [a1]
> > BBBBBBBB [b1]
> > BBBBBBBB [b2]
> > BBBBBBBB [b3]
> > CCCCCCCC [a2]
> > CCCCCCCC [a3]
> > DDDDDDDD [c2]
> > DDDDDDDD [c3]
> > -----------------------
> >
> >
> > So, why has it not grouped all the AAAAAAAA's etc so that it, instead
> looks
> > like this:
> >
> > AAAAAAAA [a0,b0,c0]
> > BBBBBBBB [a1,b1,c1]
> > BBBBBBBB [a1,b2,c1]
> > BBBBBBBB [a1,b3,c1]
> > CCCCCCCC [a2,,]
> > CCCCCCCC [a3,,]
> > DDDDDDDD [,,c2]
> > DDDDDDDD [,,c3]
> >
> > ?
> >
> > ---------------------
> >
> > I have another question. Instead of these Key/Value pairs, what if I
> > have two input files list1.txt and list2.txt, both containing a list
> > of names, one line per name. I want to JOIN these input files BY the
> > names in each list. i.e. I want to create an output file containing a
> > list of the names that appear in both the input lists. Is it possible
> > to adapt the Join example packaged with Hadoop to implement this?
> >
> >
> > Many thanks,
> >
> > Rob Stewart
> >
>
Re: Join Hadoop Example problem
Posted by abhishek sharma <ab...@usc.edu>.
What is the exact command that you are giving when submitting the
jobs? I did not see it in your e-mail.
Abhishek
On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application
> within the hadoop-*examples.jar . I can't seem to figure it out, where am I
> going wrong? It isn't grouping the keys together, as I would expect....
> ------------------------
>> bin/hadoop dfs -cat join/a.txt
> AAAAAAAA,a0
> BBBBBBBB,a1
> CCCCCCCC,a2
> CCCCCCCC,a3
>
>> bin/hadoop dfs -cat join/b.txt
> AAAAAAAA,b0
> BBBBBBBB,b1
> BBBBBBBB,b2
> BBBBBBBB,b3
>
>> bin/hadoop dfs -cat join/c.txt
> AAAAAAAA,c0
> BBBBBBBB,c1
> DDDDDDDD,c2
> DDDDDDDD,c3
>
>>
>
> -----*RESULT*-----
>>bin/hadoop dfs -text theOutputs/part-00000
> AAAAAAAA [a0]
> AAAAAAAA [b0]
> AAAAAAAA [c0]
> BBBBBBBB [c1]
> BBBBBBBB [a1]
> BBBBBBBB [b1]
> BBBBBBBB [b2]
> BBBBBBBB [b3]
> CCCCCCCC [a2]
> CCCCCCCC [a3]
> DDDDDDDD [c2]
> DDDDDDDD [c3]
> -----------------------
>
>
> So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks
> like this:
>
> AAAAAAAA [a0,b0,c0]
> BBBBBBBB [a1,b1,c1]
> BBBBBBBB [a1,b2,c1]
> BBBBBBBB [a1,b3,c1]
> CCCCCCCC [a2,,]
> CCCCCCCC [a3,,]
> DDDDDDDD [,,c2]
> DDDDDDDD [,,c3]
>
> ?
>
> ---------------------
>
> I have another question. Instead of these Key/Value pairs, what if I
> have two input files list1.txt and list2.txt, both containing a list
> of names, one line per name. I want to JOIN these input files BY the
> names in each list. i.e. I want to create an output file containing a
> list of the names that appear in both the input lists. Is it possible
> to adapt the Join example packaged with Hadoop to implement this?
>
>
> Many thanks,
>
> Rob Stewart
>