You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ninad Shringarpure <ni...@cloudera.com> on 2016/10/18 02:24:47 UTC

Fwd: jdbcRDD for data ingestion from RDBMS

Hi Team,

One of my client teams is trying to see if they can use Spark to source
data from RDBMS instead of Sqoop.  Data would be substantially large in the
order of billions of records.

I am not sure reading the documentations whether jdbcRDD by design is going
to be able to scale well for this amount of data. Plus some in-built
features provided in Sqoop like --direct might give better performance than
straight up jdbc.

My primary question to this group is if it is advisable to use jdbcRDD for
data sourcing and can we expect it to scale. Also performance wise how
would it compare to Sqoop.

Please let me know your thoughts and any pointers if anyone in the group
has already implemented it.

Thanks,
Ninad

Re: jdbcRDD for data ingestion from RDBMS

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

If we are talking about billions of records and depending on your network
and RDBMs with parallel connections, from my experience it works OK for
Dimension tables of moderate size, in that you can have parallel
connections to RDBMS (assuming the RDBMS has a primary key/unique column)
to parallelise the process and read data  "as is" in Spark using JDBC
connections.

However the other alternative is to get data into HDFS using Sqoop or even
Spark.

The third option is to use bulk copy to get the data out of RDBMS table
into a directory (csv type), scp it into HDFS host and put it into HDFS and
then you can access it though Hive external tables etc.

A real time load of data using Spark JDBC makes sense if the RDBMS table
itself is pretty small. For most dimension tables should satisfy this. This
approach is not advisable for FACT tables.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 18 October 2016 at 10:35, Teng Qiu <te...@gmail.com> wrote:

> Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional
> data source during the data processing, main goal of spark is still
> analyzing data from HDFS-like file system.
>
> to use spark as a data integration tool to transfer billions of records
> from RDBMS to HDFS etc. could work, but may not be the best tool... Sqoop
> with --direct sounds better, but the configuration costs, sqoop should be
> used for regular data integration tasks.
>
> not sure if your client need transfer billions of records periodically, if
> it is only an initial load, for such an one-off task, maybe a bash script
> with COPY command is more easier and faster :)
>
> Best,
>
> Teng
>
>
> 2016-10-18 4:24 GMT+02:00 Ninad Shringarpure <ni...@cloudera.com>:
>
>>
>> Hi Team,
>>
>> One of my client teams is trying to see if they can use Spark to source
>> data from RDBMS instead of Sqoop.  Data would be substantially large in the
>> order of billions of records.
>>
>> I am not sure reading the documentations whether jdbcRDD by design is
>> going to be able to scale well for this amount of data. Plus some in-built
>> features provided in Sqoop like --direct might give better performance than
>> straight up jdbc.
>>
>> My primary question to this group is if it is advisable to use jdbcRDD
>> for data sourcing and can we expect it to scale. Also performance wise how
>> would it compare to Sqoop.
>>
>> Please let me know your thoughts and any pointers if anyone in the group
>> has already implemented it.
>>
>> Thanks,
>> Ninad
>>
>>
>

Re: jdbcRDD for data ingestion from RDBMS

Posted by Teng Qiu <te...@gmail.com>.

Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional
data source during the data processing, main goal of spark is still
analyzing data from HDFS-like file system.

to use spark as a data integration tool to transfer billions of records
from RDBMS to HDFS etc. could work, but may not be the best tool... Sqoop
with --direct sounds better, but the configuration costs, sqoop should be
used for regular data integration tasks.

not sure if your client need transfer billions of records periodically, if
it is only an initial load, for such an one-off task, maybe a bash script
with COPY command is more easier and faster :)

Best,

Teng


2016-10-18 4:24 GMT+02:00 Ninad Shringarpure <ni...@cloudera.com>:

>
> Hi Team,
>
> One of my client teams is trying to see if they can use Spark to source
> data from RDBMS instead of Sqoop.  Data would be substantially large in the
> order of billions of records.
>
> I am not sure reading the documentations whether jdbcRDD by design is
> going to be able to scale well for this amount of data. Plus some in-built
> features provided in Sqoop like --direct might give better performance than
> straight up jdbc.
>
> My primary question to this group is if it is advisable to use jdbcRDD for
> data sourcing and can we expect it to scale. Also performance wise how
> would it compare to Sqoop.
>
> Please let me know your thoughts and any pointers if anyone in the group
> has already implemented it.
>
> Thanks,
> Ninad
>
>