You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/01/26 02:43:45 UTC

Join Hadoop Example problem

Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application
within the hadoop-*examples.jar . I can't seem to figure it out, where am I
going wrong? It isn't grouping the keys together, as I would expect....
------------------------
> bin/hadoop dfs -cat join/a.txt
AAAAAAAA,a0
BBBBBBBB,a1
CCCCCCCC,a2
CCCCCCCC,a3

> bin/hadoop dfs -cat join/b.txt
AAAAAAAA,b0
BBBBBBBB,b1
BBBBBBBB,b2
BBBBBBBB,b3

> bin/hadoop dfs -cat join/c.txt
AAAAAAAA,c0
BBBBBBBB,c1
DDDDDDDD,c2
DDDDDDDD,c3

>

-----*RESULT*-----
>bin/hadoop dfs -text theOutputs/part-00000
AAAAAAAA        [a0]
AAAAAAAA        [b0]
AAAAAAAA        [c0]
BBBBBBBB        [c1]
BBBBBBBB        [a1]
BBBBBBBB        [b1]
BBBBBBBB        [b2]
BBBBBBBB        [b3]
CCCCCCCC        [a2]
CCCCCCCC        [a3]
DDDDDDDD        [c2]
DDDDDDDD        [c3]
-----------------------


So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks
like this:

AAAAAAAA        [a0,b0,c0]
BBBBBBBB        [a1,b1,c1]
BBBBBBBB        [a1,b2,c1]
BBBBBBBB        [a1,b3,c1]
CCCCCCCC        [a2,,]
CCCCCCCC        [a3,,]
DDDDDDDD        [,,c2]
DDDDDDDD        [,,c3]

?

---------------------

I have another question. Instead of these Key/Value pairs, what if I
have two input files list1.txt and list2.txt, both containing a list
of names, one line per name. I want to JOIN these input files BY the
names in each list. i.e. I want to create an output file containing a
list of the names that appear in both the input lists. Is it possible
to adapt the Join example packaged with Hadoop to implement this?


Many thanks,

Rob Stewart

Re: Join Hadoop Example problem

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Rob, When you give Hive a directory name, it treats all the files as a
single table (kind of counterintuitive, but very helpful if you work with
large data sets).  Try to create 3 separate directories:

tablea/a.txt
tableb/b.txt
tablec/c.txt

and run the query as:

> bin/hadoop jar hadoop-*-examples.jar join -D
key.value.separator.in.input.line=',' -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat  -outKey
org.apache.hadoop.io.Text  join tablea tableb tablec theOutputs

Alex K
On Mon, Jan 25, 2010 at 6:25 PM, Rob Stewart <ro...@googlemail.com>wrote:

> Good point, I missed that. It is:
>
> > bin/hadoop jar hadoop-*-examples.jar join -D
> key.value.separator.in.input.line=',' -inFormat
> org.apache.hadoop.mapred.KeyValueTextInputFormat  -outKey
> org.apache.hadoop.io.Text  join/  theOutputs
>
> Rob
>
>
> 2010/1/26 abhishek sharma <ab...@usc.edu>
>
> > What is the exact command that you are giving when submitting the
> > jobs? I did not see it in your e-mail.
> >
> > Abhishek
> >
> > On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
> > <ro...@googlemail.com> wrote:
> > > Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join
> > application
> > > within the hadoop-*examples.jar . I can't seem to figure it out, where
> am
> > I
> > > going wrong? It isn't grouping the keys together, as I would expect....
> > > ------------------------
> > >> bin/hadoop dfs -cat join/a.txt
> > > AAAAAAAA,a0
> > > BBBBBBBB,a1
> > > CCCCCCCC,a2
> > > CCCCCCCC,a3
> > >
> > >> bin/hadoop dfs -cat join/b.txt
> > > AAAAAAAA,b0
> > > BBBBBBBB,b1
> > > BBBBBBBB,b2
> > > BBBBBBBB,b3
> > >
> > >> bin/hadoop dfs -cat join/c.txt
> > > AAAAAAAA,c0
> > > BBBBBBBB,c1
> > > DDDDDDDD,c2
> > > DDDDDDDD,c3
> > >
> > >>
> > >
> > > -----*RESULT*-----
> > >>bin/hadoop dfs -text theOutputs/part-00000
> > > AAAAAAAA        [a0]
> > > AAAAAAAA        [b0]
> > > AAAAAAAA        [c0]
> > > BBBBBBBB        [c1]
> > > BBBBBBBB        [a1]
> > > BBBBBBBB        [b1]
> > > BBBBBBBB        [b2]
> > > BBBBBBBB        [b3]
> > > CCCCCCCC        [a2]
> > > CCCCCCCC        [a3]
> > > DDDDDDDD        [c2]
> > > DDDDDDDD        [c3]
> > > -----------------------
> > >
> > >
> > > So, why has it not grouped all the AAAAAAAA's etc so that it, instead
> > looks
> > > like this:
> > >
> > > AAAAAAAA        [a0,b0,c0]
> > > BBBBBBBB        [a1,b1,c1]
> > > BBBBBBBB        [a1,b2,c1]
> > > BBBBBBBB        [a1,b3,c1]
> > > CCCCCCCC        [a2,,]
> > > CCCCCCCC        [a3,,]
> > > DDDDDDDD        [,,c2]
> > > DDDDDDDD        [,,c3]
> > >
> > > ?
> > >
> > > ---------------------
> > >
> > > I have another question. Instead of these Key/Value pairs, what if I
> > > have two input files list1.txt and list2.txt, both containing a list
> > > of names, one line per name. I want to JOIN these input files BY the
> > > names in each list. i.e. I want to create an output file containing a
> > > list of the names that appear in both the input lists. Is it possible
> > > to adapt the Join example packaged with Hadoop to implement this?
> > >
> > >
> > > Many thanks,
> > >
> > > Rob Stewart
> > >
> >
>

Re: Join Hadoop Example problem

Posted by Rob Stewart <ro...@googlemail.com>.

Good point, I missed that. It is:

> bin/hadoop jar hadoop-*-examples.jar join -D
key.value.separator.in.input.line=',' -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat  -outKey
org.apache.hadoop.io.Text  join/  theOutputs



Rob


2010/1/26 abhishek sharma <ab...@usc.edu>

> What is the exact command that you are giving when submitting the
> jobs? I did not see it in your e-mail.
>
> Abhishek
>
> On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
> <ro...@googlemail.com> wrote:
> > Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join
> application
> > within the hadoop-*examples.jar . I can't seem to figure it out, where am
> I
> > going wrong? It isn't grouping the keys together, as I would expect....
> > ------------------------
> >> bin/hadoop dfs -cat join/a.txt
> > AAAAAAAA,a0
> > BBBBBBBB,a1
> > CCCCCCCC,a2
> > CCCCCCCC,a3
> >
> >> bin/hadoop dfs -cat join/b.txt
> > AAAAAAAA,b0
> > BBBBBBBB,b1
> > BBBBBBBB,b2
> > BBBBBBBB,b3
> >
> >> bin/hadoop dfs -cat join/c.txt
> > AAAAAAAA,c0
> > BBBBBBBB,c1
> > DDDDDDDD,c2
> > DDDDDDDD,c3
> >
> >>
> >
> > -----*RESULT*-----
> >>bin/hadoop dfs -text theOutputs/part-00000
> > AAAAAAAA        [a0]
> > AAAAAAAA        [b0]
> > AAAAAAAA        [c0]
> > BBBBBBBB        [c1]
> > BBBBBBBB        [a1]
> > BBBBBBBB        [b1]
> > BBBBBBBB        [b2]
> > BBBBBBBB        [b3]
> > CCCCCCCC        [a2]
> > CCCCCCCC        [a3]
> > DDDDDDDD        [c2]
> > DDDDDDDD        [c3]
> > -----------------------
> >
> >
> > So, why has it not grouped all the AAAAAAAA's etc so that it, instead
> looks
> > like this:
> >
> > AAAAAAAA        [a0,b0,c0]
> > BBBBBBBB        [a1,b1,c1]
> > BBBBBBBB        [a1,b2,c1]
> > BBBBBBBB        [a1,b3,c1]
> > CCCCCCCC        [a2,,]
> > CCCCCCCC        [a3,,]
> > DDDDDDDD        [,,c2]
> > DDDDDDDD        [,,c3]
> >
> > ?
> >
> > ---------------------
> >
> > I have another question. Instead of these Key/Value pairs, what if I
> > have two input files list1.txt and list2.txt, both containing a list
> > of names, one line per name. I want to JOIN these input files BY the
> > names in each list. i.e. I want to create an output file containing a
> > list of the names that appear in both the input lists. Is it possible
> > to adapt the Join example packaged with Hadoop to implement this?
> >
> >
> > Many thanks,
> >
> > Rob Stewart
> >
>

Re: Join Hadoop Example problem

Posted by abhishek sharma <ab...@usc.edu>.

What is the exact command that you are giving when submitting the
jobs? I did not see it in your e-mail.

Abhishek

On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application
> within the hadoop-*examples.jar . I can't seem to figure it out, where am I
> going wrong? It isn't grouping the keys together, as I would expect....
> ------------------------
>> bin/hadoop dfs -cat join/a.txt
> AAAAAAAA,a0
> BBBBBBBB,a1
> CCCCCCCC,a2
> CCCCCCCC,a3
>
>> bin/hadoop dfs -cat join/b.txt
> AAAAAAAA,b0
> BBBBBBBB,b1
> BBBBBBBB,b2
> BBBBBBBB,b3
>
>> bin/hadoop dfs -cat join/c.txt
> AAAAAAAA,c0
> BBBBBBBB,c1
> DDDDDDDD,c2
> DDDDDDDD,c3
>
>>
>
> -----*RESULT*-----
>>bin/hadoop dfs -text theOutputs/part-00000
> AAAAAAAA        [a0]
> AAAAAAAA        [b0]
> AAAAAAAA        [c0]
> BBBBBBBB        [c1]
> BBBBBBBB        [a1]
> BBBBBBBB        [b1]
> BBBBBBBB        [b2]
> BBBBBBBB        [b3]
> CCCCCCCC        [a2]
> CCCCCCCC        [a3]
> DDDDDDDD        [c2]
> DDDDDDDD        [c3]
> -----------------------
>
>
> So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks
> like this:
>
> AAAAAAAA        [a0,b0,c0]
> BBBBBBBB        [a1,b1,c1]
> BBBBBBBB        [a1,b2,c1]
> BBBBBBBB        [a1,b3,c1]
> CCCCCCCC        [a2,,]
> CCCCCCCC        [a3,,]
> DDDDDDDD        [,,c2]
> DDDDDDDD        [,,c3]
>
> ?
>
> ---------------------
>
> I have another question. Instead of these Key/Value pairs, what if I
> have two input files list1.txt and list2.txt, both containing a list
> of names, one line per name. I want to JOIN these input files BY the
> names in each list. i.e. I want to create an output file containing a
> list of the names that appear in both the input lists. Is it possible
> to adapt the Join example packaged with Hadoop to implement this?
>
>
> Many thanks,
>
> Rob Stewart
>