You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Rex X <dn...@gmail.com> on 2016/01/22 18:50:43 UTC

What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

The two SequentialTextFiles correspond to two Hive tables, say tableA and
tableB below on

    hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
and
    hdfs://hive/tableB/YYYY/MM/DD/*/part-00000

Both of them are partitioned by date, for example,

    hdfs://hive/tableA/2016/01/01/*/part-00000

Now we want to do a left outer join on tableA.id=tableB.id, for a date
range, for example, from 2015/12/01 to 2016/01/09.

Within Hive it is pretty easy

    select * from tableA a left outer join tableB b
    on a.id=b.id
    where a.dt is between '20151201' and '20160109'
    and b.dt is between '20151201' and '20160109';


What's the best way to do Outer join and Inner join of these two
SequentialTextFiles using Hadoop streaming and Python ?

Any comments will be appreciated!

Re: What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

Posted by Rex X <dn...@gmail.com>.

Googled, but didnot find any sample code.


On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:

> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
>     hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
>     hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
>     hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
>     select * from tableA a left outer join tableB b
>     on a.id=b.id
>     where a.dt is between '20151201' and '20160109'
>     and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>

Re: What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

Posted by Rex X <dn...@gmail.com>.

Googled, but didnot find any sample code.


On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:

> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
>     hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
>     hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
>     hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
>     select * from tableA a left outer join tableB b
>     on a.id=b.id
>     where a.dt is between '20151201' and '20160109'
>     and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>

Re: What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

Posted by Rex X <dn...@gmail.com>.

Googled, but didnot find any sample code.


On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:

> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
>     hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
>     hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
>     hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
>     select * from tableA a left outer join tableB b
>     on a.id=b.id
>     where a.dt is between '20151201' and '20160109'
>     and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>

Re: What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

Posted by Rex X <dn...@gmail.com>.

Googled, but didnot find any sample code.


On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:

> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
>     hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
>     hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
>     hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
>     select * from tableA a left outer join tableB b
>     on a.id=b.id
>     where a.dt is between '20151201' and '20160109'
>     and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>