You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chhaya Vishwakarma <Ch...@lntinfotech.com> on 2014/03/19 09:57:18 UTC

Joining two HDFS files in in Spark

Hi

I want to join two files from HDFS using spark shell.
Both the files are tab separated and I want to join on second column

Tried code
But not giving any output

val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily"))

val ny_daily_split = ny_daily.map(line =>line.split('\t'))

val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))


val ny_dividend= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))

val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))

val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0, 4), line(3).toInt))

enKeyValuePair1.join(enKeyValuePair)


But I am not getting any information for how to join files on particular column
Please suggest



Regards,
Chhaya Vishwakarma


________________________________
The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. L&T Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail"

Re: Joining two HDFS files in in Spark

Posted by Shixiong Zhu <zs...@gmail.com>.

Do you want to read the file content in the following statement?

val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/
NYstock/NYSE_daily"))

If so, you should use "textFile", e.g.,

val ny_daily= sc.textFile("hdfs://localhost:8020/user/user/
NYstock/NYSE_daily")

"parallelize" is used to create a RDD from a collection.


Best Regards,
Shixiong Zhu


2014-03-19 20:52 GMT+08:00 Yana Kadiyska <ya...@gmail.com>:

> Not sure what you mean by "not getting information how to join". If
> you mean that you can't see the result I believe you need to collect
> the result of the join on the driver, as in
>
> val joinedRdd=enKeyValuePair1.join(enKeyValuePair)
> joinedRdd.collect().map(prinltn)
>
>
>
> On Wed, Mar 19, 2014 at 4:57 AM, Chhaya Vishwakarma
> <Ch...@lntinfotech.com> wrote:
> > Hi
> >
> >
> >
> > I want to join two files from HDFS using spark shell.
> >
> > Both the files are tab separated and I want to join on second column
> >
> >
> >
> > Tried code
> >
> > But not giving any output
> >
> >
> >
> > val ny_daily=
> >
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily"))
> >
> >
> >
> > val ny_daily_split = ny_daily.map(line =>line.split('\t'))
> >
> >
> >
> > val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5),
> > line(3).toInt))
> >
> >
> >
> >
> >
> > val ny_dividend=
> >
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
> >
> >
> >
> > val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
> >
> >
> >
> > val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0,
> > 4), line(3).toInt))
> >
> >
> >
> > enKeyValuePair1.join(enKeyValuePair)
> >
> >
> >
> >
> >
> > But I am not getting any information for how to join files on particular
> > column
> >
> > Please suggest
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> >
> > Chhaya Vishwakarma
> >
> >
> >
> >
> > ________________________________
> > The contents of this e-mail and any attachment(s) may contain
> confidential
> > or privileged information for the intended recipient(s). Unintended
> > recipients are prohibited from taking action on the basis of information
> in
> > this e-mail and using or disseminating the information, and must notify
> the
> > sender and delete it from their system. L&T Infotech will not accept
> > responsibility or liability for the accuracy or completeness of, or the
> > presence of any virus or disabling code in this e-mail"
>

Re: Joining two HDFS files in in Spark

Posted by Yana Kadiyska <ya...@gmail.com>.

Not sure what you mean by "not getting information how to join". If
you mean that you can't see the result I believe you need to collect
the result of the join on the driver, as in

val joinedRdd=enKeyValuePair1.join(enKeyValuePair)
joinedRdd.collect().map(prinltn)



On Wed, Mar 19, 2014 at 4:57 AM, Chhaya Vishwakarma
<Ch...@lntinfotech.com> wrote:
> Hi
>
>
>
> I want to join two files from HDFS using spark shell.
>
> Both the files are tab separated and I want to join on second column
>
>
>
> Tried code
>
> But not giving any output
>
>
>
> val ny_daily=
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily"))
>
>
>
> val ny_daily_split = ny_daily.map(line =>line.split('\t'))
>
>
>
> val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5),
> line(3).toInt))
>
>
>
>
>
> val ny_dividend=
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
>
>
>
> val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
>
>
>
> val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0,
> 4), line(3).toInt))
>
>
>
> enKeyValuePair1.join(enKeyValuePair)
>
>
>
>
>
> But I am not getting any information for how to join files on particular
> column
>
> Please suggest
>
>
>
>
>
>
>
> Regards,
>
> Chhaya Vishwakarma
>
>
>
>
> ________________________________
> The contents of this e-mail and any attachment(s) may contain confidential
> or privileged information for the intended recipient(s). Unintended
> recipients are prohibited from taking action on the basis of information in
> this e-mail and using or disseminating the information, and must notify the
> sender and delete it from their system. L&T Infotech will not accept
> responsibility or liability for the accuracy or completeness of, or the
> presence of any virus or disabling code in this e-mail"