You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Rex X <dn...@gmail.com> on 2016/01/22 18:50:43 UTC
What's the best way to do Outer join and Inner join of two
SequentialTextFiles using Hadoop streaming and Python ?
The two SequentialTextFiles correspond to two Hive tables, say tableA and
tableB below on
hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
and
hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
Both of them are partitioned by date, for example,
hdfs://hive/tableA/2016/01/01/*/part-00000
Now we want to do a left outer join on tableA.id=tableB.id, for a date
range, for example, from 2015/12/01 to 2016/01/09.
Within Hive it is pretty easy
select * from tableA a left outer join tableB b
on a.id=b.id
where a.dt is between '20151201' and '20160109'
and b.dt is between '20151201' and '20160109';
What's the best way to do Outer join and Inner join of these two
SequentialTextFiles using Hadoop streaming and Python ?
Any comments will be appreciated!
Re: What's the best way to do Outer join and Inner join of two
SequentialTextFiles using Hadoop streaming and Python ?
Posted by Rex X <dn...@gmail.com>.
Googled, but didnot find any sample code.
On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:
> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
> hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
> hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
> hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
> select * from tableA a left outer join tableB b
> on a.id=b.id
> where a.dt is between '20151201' and '20160109'
> and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>
Re: What's the best way to do Outer join and Inner join of two
SequentialTextFiles using Hadoop streaming and Python ?
Posted by Rex X <dn...@gmail.com>.
Googled, but didnot find any sample code.
On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:
> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
> hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
> hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
> hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
> select * from tableA a left outer join tableB b
> on a.id=b.id
> where a.dt is between '20151201' and '20160109'
> and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>
Re: What's the best way to do Outer join and Inner join of two
SequentialTextFiles using Hadoop streaming and Python ?
Posted by Rex X <dn...@gmail.com>.
Googled, but didnot find any sample code.
On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:
> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
> hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
> hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
> hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
> select * from tableA a left outer join tableB b
> on a.id=b.id
> where a.dt is between '20151201' and '20160109'
> and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>
Re: What's the best way to do Outer join and Inner join of two
SequentialTextFiles using Hadoop streaming and Python ?
Posted by Rex X <dn...@gmail.com>.
Googled, but didnot find any sample code.
On Fri, Jan 22, 2016 at 9:50 AM, Rex X <dn...@gmail.com> wrote:
> The two SequentialTextFiles correspond to two Hive tables, say tableA and
> tableB below on
>
> hdfs://hive/tableA/YYYY/MM/DD/*/part-00000
> and
> hdfs://hive/tableB/YYYY/MM/DD/*/part-00000
>
> Both of them are partitioned by date, for example,
>
> hdfs://hive/tableA/2016/01/01/*/part-00000
>
> Now we want to do a left outer join on tableA.id=tableB.id, for a date
> range, for example, from 2015/12/01 to 2016/01/09.
>
> Within Hive it is pretty easy
>
> select * from tableA a left outer join tableB b
> on a.id=b.id
> where a.dt is between '20151201' and '20160109'
> and b.dt is between '20151201' and '20160109';
>
>
> What's the best way to do Outer join and Inner join of these two
> SequentialTextFiles using Hadoop streaming and Python ?
>
> Any comments will be appreciated!
>
>
>
>