You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2015/07/01 16:19:59 UTC

custom RDD in java

Hi

Is it possible to write custom RDD in java?

Requirement is - I am having a list of Sqlserver tables  need to be dumped
in HDFS.

So I have a
List<String> tables = {dbname.tablename,dbname.tablename2......};

then
JavaRDD<String> rdd = javasparkcontext.parllelise(tables);

JavaRDDString> tablecontent = rdd.map(new
Function<String,Iterable<String>>){fetch table and return populate iterable}

tablecontent.storeAsTextFile("hffs path");


In rdd.map(new Function<String,>). I cannot keep complete table content in
memory , so I want to creat my own RDD to handle it.

Thanks
Shushant

Re: custom RDD in java

Posted by Feynman Liang <fl...@databricks.com>.
AFAIK RDDs can only be created on the driver, not the executors. Also,
`saveAsTextFile(...)` is an action and hence can also only be executed on
the driver.

As Silvio already mentioned, Sqoop may be a good option.

On Wed, Jul 1, 2015 at 12:46 PM, Shushant Arora <sh...@gmail.com>
wrote:

> List of tables is not large , RDD is created on table list to parllelise
> the work of fetching tables in multiple mappers at same time.Since time
> taken to fetch a table is significant , so can't run that sequentially.
>
>
> Content of table fetched by a map job is large, so one option is to dump
> content to hdfs using filesystem api from inside map function for every few
> rows of table fetched.
>
> I cannot keep complete table in memory and then dump in hdfs using below
> map function-
>
> JavaRDD<String> tablecontent = tablelistrdd.map(new
> Function<String,Iterable<String>>)
> {public Iterable<String> call(String tablename){
> ..make jdbc connection get table data and populate in list and return
> that..
>  }
>  tablecontent .saveAsTextFile("hdfspath");
>
> Here I wanted to create customRDD- whose partitions would be in memory on
> multiple executors and contains parts of table data. And i would have
> called saveAsTextFile on customRDD directly to save in hdfs.
>
>
>
> On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang <fl...@databricks.com>
> wrote:
>
>>
>> On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora <shushantarora09@gmail.com
>> > wrote:
>>
>>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>
>>
>> You are already creating an RDD in Java here ;)
>>
>> However, it's not clear to me why you'd want to make this an RDD. Is the
>> list of tables so large that it doesn't fit on a single machine? If not,
>> you may be better off spinning up one spark job for dumping each table in
>> tables using a JDBC datasource
>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
>> .
>>
>> On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
>> silvio.fiorito@granturing.com> wrote:
>>
>>>   Sure, you can create custom RDDs. Haven’t done so in Java, but in
>>> Scala absolutely.
>>>
>>>   From: Shushant Arora
>>> Date: Wednesday, July 1, 2015 at 1:44 PM
>>> To: Silvio Fiorito
>>> Cc: user
>>> Subject: Re: custom RDD in java
>>>
>>>   ok..will evaluate these options but is it possible to create RDD in
>>> java?
>>>
>>>
>>> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
>>> silvio.fiorito@granturing.com> wrote:
>>>
>>>>  If all you’re doing is just dumping tables from SQLServer to HDFS,
>>>> have you looked at Sqoop?
>>>>
>>>>  Otherwise, if you need to run this in Spark could you just use the
>>>> existing JdbcRDD?
>>>>
>>>>
>>>>   From: Shushant Arora
>>>> Date: Wednesday, July 1, 2015 at 10:19 AM
>>>> To: user
>>>> Subject: custom RDD in java
>>>>
>>>>   Hi
>>>>
>>>>  Is it possible to write custom RDD in java?
>>>>
>>>>  Requirement is - I am having a list of Sqlserver tables  need to be
>>>> dumped in HDFS.
>>>>
>>>>  So I have a
>>>> List<String> tables = {dbname.tablename,dbname.tablename2......};
>>>>
>>>>  then
>>>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>>>
>>>>  JavaRDDString> tablecontent = rdd.map(new
>>>> Function<String,Iterable<String>>){fetch table and return populate iterable}
>>>>
>>>>  tablecontent.storeAsTextFile("hffs path");
>>>>
>>>>
>>>>  In rdd.map(new Function<String,>). I cannot keep complete table
>>>> content in memory , so I want to creat my own RDD to handle it.
>>>>
>>>>  Thanks
>>>> Shushant
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: custom RDD in java

Posted by Shushant Arora <sh...@gmail.com>.
List of tables is not large , RDD is created on table list to parllelise
the work of fetching tables in multiple mappers at same time.Since time
taken to fetch a table is significant , so can't run that sequentially.


Content of table fetched by a map job is large, so one option is to dump
content to hdfs using filesystem api from inside map function for every few
rows of table fetched.

I cannot keep complete table in memory and then dump in hdfs using below
map function-

JavaRDD<String> tablecontent = tablelistrdd.map(new
Function<String,Iterable<String>>)
{public Iterable<String> call(String tablename){
..make jdbc connection get table data and populate in list and return that..
 }
 tablecontent .saveAsTextFile("hdfspath");

Here I wanted to create customRDD- whose partitions would be in memory on
multiple executors and contains parts of table data. And i would have
called saveAsTextFile on customRDD directly to save in hdfs.



On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang <fl...@databricks.com>
wrote:

>
> On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora <sh...@gmail.com>
>  wrote:
>
>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>
>
> You are already creating an RDD in Java here ;)
>
> However, it's not clear to me why you'd want to make this an RDD. Is the
> list of tables so large that it doesn't fit on a single machine? If not,
> you may be better off spinning up one spark job for dumping each table in
> tables using a JDBC datasource
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
> .
>
> On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
> silvio.fiorito@granturing.com> wrote:
>
>>   Sure, you can create custom RDDs. Haven’t done so in Java, but in
>> Scala absolutely.
>>
>>   From: Shushant Arora
>> Date: Wednesday, July 1, 2015 at 1:44 PM
>> To: Silvio Fiorito
>> Cc: user
>> Subject: Re: custom RDD in java
>>
>>   ok..will evaluate these options but is it possible to create RDD in
>> java?
>>
>>
>> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
>> silvio.fiorito@granturing.com> wrote:
>>
>>>  If all you’re doing is just dumping tables from SQLServer to HDFS,
>>> have you looked at Sqoop?
>>>
>>>  Otherwise, if you need to run this in Spark could you just use the
>>> existing JdbcRDD?
>>>
>>>
>>>   From: Shushant Arora
>>> Date: Wednesday, July 1, 2015 at 10:19 AM
>>> To: user
>>> Subject: custom RDD in java
>>>
>>>   Hi
>>>
>>>  Is it possible to write custom RDD in java?
>>>
>>>  Requirement is - I am having a list of Sqlserver tables  need to be
>>> dumped in HDFS.
>>>
>>>  So I have a
>>> List<String> tables = {dbname.tablename,dbname.tablename2......};
>>>
>>>  then
>>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>>
>>>  JavaRDDString> tablecontent = rdd.map(new
>>> Function<String,Iterable<String>>){fetch table and return populate iterable}
>>>
>>>  tablecontent.storeAsTextFile("hffs path");
>>>
>>>
>>>  In rdd.map(new Function<String,>). I cannot keep complete table
>>> content in memory , so I want to creat my own RDD to handle it.
>>>
>>>  Thanks
>>> Shushant
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: custom RDD in java

Posted by Feynman Liang <fl...@databricks.com>.
On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora <sh...@gmail.com>
 wrote:

> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);


You are already creating an RDD in Java here ;)

However, it's not clear to me why you'd want to make this an RDD. Is the
list of tables so large that it doesn't fit on a single machine? If not,
you may be better off spinning up one spark job for dumping each table in
tables using a JDBC datasource
<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
.

On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
silvio.fiorito@granturing.com> wrote:

>   Sure, you can create custom RDDs. Haven’t done so in Java, but in Scala
> absolutely.
>
>   From: Shushant Arora
> Date: Wednesday, July 1, 2015 at 1:44 PM
> To: Silvio Fiorito
> Cc: user
> Subject: Re: custom RDD in java
>
>   ok..will evaluate these options but is it possible to create RDD in
> java?
>
>
> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
> silvio.fiorito@granturing.com> wrote:
>
>>  If all you’re doing is just dumping tables from SQLServer to HDFS, have
>> you looked at Sqoop?
>>
>>  Otherwise, if you need to run this in Spark could you just use the
>> existing JdbcRDD?
>>
>>
>>   From: Shushant Arora
>> Date: Wednesday, July 1, 2015 at 10:19 AM
>> To: user
>> Subject: custom RDD in java
>>
>>   Hi
>>
>>  Is it possible to write custom RDD in java?
>>
>>  Requirement is - I am having a list of Sqlserver tables  need to be
>> dumped in HDFS.
>>
>>  So I have a
>> List<String> tables = {dbname.tablename,dbname.tablename2......};
>>
>>  then
>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>
>>  JavaRDDString> tablecontent = rdd.map(new
>> Function<String,Iterable<String>>){fetch table and return populate iterable}
>>
>>  tablecontent.storeAsTextFile("hffs path");
>>
>>
>>  In rdd.map(new Function<String,>). I cannot keep complete table content
>> in memory , so I want to creat my own RDD to handle it.
>>
>>  Thanks
>> Shushant
>>
>>
>>
>>
>>
>>
>>
>

Re: custom RDD in java

Posted by Silvio Fiorito <si...@granturing.com>.
Sure, you can create custom RDDs. Haven’t done so in Java, but in Scala absolutely.

From: Shushant Arora
Date: Wednesday, July 1, 2015 at 1:44 PM
To: Silvio Fiorito
Cc: user
Subject: Re: custom RDD in java

ok..will evaluate these options but is it possible to create RDD in java?


On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <si...@granturing.com>> wrote:
If all you’re doing is just dumping tables from SQLServer to HDFS, have you looked at Sqoop?

Otherwise, if you need to run this in Spark could you just use the existing JdbcRDD?


From: Shushant Arora
Date: Wednesday, July 1, 2015 at 10:19 AM
To: user
Subject: custom RDD in java

Hi

Is it possible to write custom RDD in java?

Requirement is - I am having a list of Sqlserver tables  need to be dumped in HDFS.

So I have a
List<String> tables = {dbname.tablename,dbname.tablename2......};

then
JavaRDD<String> rdd = javasparkcontext.parllelise(tables);

JavaRDDString> tablecontent = rdd.map(new Function<String,Iterable<String>>){fetch table and return populate iterable}

tablecontent.storeAsTextFile("hffs path");


In rdd.map(new Function<String,>). I cannot keep complete table content in memory , so I want to creat my own RDD to handle it.

Thanks
Shushant








Re: custom RDD in java

Posted by Shushant Arora <sh...@gmail.com>.
ok..will evaluate these options but is it possible to create RDD in java?


On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
silvio.fiorito@granturing.com> wrote:

>  If all you’re doing is just dumping tables from SQLServer to HDFS, have
> you looked at Sqoop?
>
>  Otherwise, if you need to run this in Spark could you just use the
> existing JdbcRDD?
>
>
>   From: Shushant Arora
> Date: Wednesday, July 1, 2015 at 10:19 AM
> To: user
> Subject: custom RDD in java
>
>   Hi
>
>  Is it possible to write custom RDD in java?
>
>  Requirement is - I am having a list of Sqlserver tables  need to be
> dumped in HDFS.
>
>  So I have a
> List<String> tables = {dbname.tablename,dbname.tablename2......};
>
>  then
> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>
>  JavaRDDString> tablecontent = rdd.map(new
> Function<String,Iterable<String>>){fetch table and return populate iterable}
>
>  tablecontent.storeAsTextFile("hffs path");
>
>
>  In rdd.map(new Function<String,>). I cannot keep complete table content
> in memory , so I want to creat my own RDD to handle it.
>
>  Thanks
> Shushant
>
>
>
>
>
>
>

Re: custom RDD in java

Posted by Silvio Fiorito <si...@granturing.com>.
If all you’re doing is just dumping tables from SQLServer to HDFS, have you looked at Sqoop?

Otherwise, if you need to run this in Spark could you just use the existing JdbcRDD?


From: Shushant Arora
Date: Wednesday, July 1, 2015 at 10:19 AM
To: user
Subject: custom RDD in java

Hi

Is it possible to write custom RDD in java?

Requirement is - I am having a list of Sqlserver tables  need to be dumped in HDFS.

So I have a
List<String> tables = {dbname.tablename,dbname.tablename2......};

then
JavaRDD<String> rdd = javasparkcontext.parllelise(tables);

JavaRDDString> tablecontent = rdd.map(new Function<String,Iterable<String>>){fetch table and return populate iterable}

tablecontent.storeAsTextFile("hffs path");


In rdd.map(new Function<String,>). I cannot keep complete table content in memory , so I want to creat my own RDD to handle it.

Thanks
Shushant