You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tapan Upadhyay <ta...@gmail.com> on 2016/05/04 03:29:35 UTC

migration from Teradata to Spark SQL

Hi,

We are planning to move our adhoc queries from teradata to spark. We have
huge volume of queries during the day. What is best way to go about it -

1) Read data directly from teradata db using spark jdbc

2) Import data using sqoop by EOD jobs into hive tables stored as parquet
and then run queries on hive tables using spark sql or spark hive context.

any other ways through which we can do it in a better/efficiently?

Please guide.

Regards,
Tapan

Re: migration from Teradata to Spark SQL

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

How are you going to sync your data following migration?

Spark SQL is a tool for querying data. It is not a database per se like
Hive or anything else.

I am just doing the same migrating Sybase IQ to Hive.

Sqoop can do the initial ELT (read ELT not ETL). In other words use Sqoop
to get data as is from Teradata to Hive table and then use Hive for further
cleansing etc.

It all depends how you want to approach this and how many tables are
involved and your schema. For example are we talking about FACT tables
only. You can easily keep your DIMENSION tables in Teradata and use Spark
SQL to load data from Teradata and Hive.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 4 May 2016 at 02:29, Tapan Upadhyay <ta...@gmail.com> wrote:

> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Read data directly from teradata db using spark jdbc
>
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
> and then run queries on hive tables using spark sql or spark hive context.
>
> any other ways through which we can do it in a better/efficiently?
>
> Please guide.
>
> Regards,
> Tapan
>
>

Re: migration from Teradata to Spark SQL

Posted by Lohith Samaga M <Lo...@mphasis.com>.

Hi
Can you look at Apache Drill as sql engine on hive?

Lohith

Sent from my Sony Xperia™ smartphone

---- Tapan Upadhyay wrote ----

Thank you everyone for guidance.

Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have enough bandwidth on our DB for imp batch/queries.

For implementing lambda architecture is it possible to get the real time updates from Teradata of any insert/update/delete? DBlogs?

Deepak should we query data from cassandra using spark? how it will be different in terms of performance if we store our data in hive tables(parquet) and query using spark? in case there is not much performance gain why add one more layer of processing

Mich we plan to sync the data using sqoop hourly/EOD jobs? still not decided how frequently we would need to do that. It will be based on user requirement. In case they need real time data we need to think of an alternative? How are you doing the same for Sybase? How you sync real time?

Thank you!!

Regards,
Tapan Upadhyay
+1 973 652 8757

On Wed, May 4, 2016 at 4:33 AM, Alonso Isidoro Roman <al...@gmail.com>> wrote:
I agree with Deepak and i would try to save data in parquet and avro format, if you can, try to measure the performance and choose the best, it will probably be parquet, but you have to know for yourself.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..."
- Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must be the process of putting ..."
- Edsger Dijkstra

"If you pay peanuts you get monkeys"

2016-05-04 9:22 GMT+02:00 Jörn Franke <jo...@gmail.com>>:
Look at lambda architecture.

What is the motivation of your migration?

On 04 May 2016, at 03:29, Tapan Upadhyay <ta...@gmail.com>> wrote:

Hi,

We are planning to move our adhoc queries from teradata to spark. We have huge volume of queries during the day. What is best way to go about it -

1) Read data directly from teradata db using spark jdbc

2) Import data using sqoop by EOD jobs into hive tables stored as parquet and then run queries on hive tables using spark sql or spark hive context.

any other ways through which we can do it in a better/efficiently?

Please guide.

Regards,
Tapan

Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.

Re: migration from Teradata to Spark SQL

Posted by Tapan Upadhyay <ta...@gmail.com>.

Thank you everyone for guidance.

*Jorn* our motivation is to move bulk of adhoc queries to hadoop so that we
have enough bandwidth on our DB for imp batch/queries.

For implementing lambda architecture is it possible to get the real time
updates from Teradata of any insert/update/delete? DBlogs?

*Deepak *should we query data from cassandra using spark? how it will be
different in terms of performance if we store our data in hive
tables(parquet) and query using spark? in case there is not much
performance gain why add one more layer of processing

*Mich *we plan to sync the data using sqoop hourly/EOD jobs? still not
decided how frequently we would need to do that. It will be based on user
requirement. In case they need real time data we need to think of an
alternative? How are you doing the same for Sybase? How you sync real time?

Thank you!!

Regards,
Tapan Upadhyay
+1 973 652 8757

On Wed, May 4, 2016 at 4:33 AM, Alonso Isidoro Roman <al...@gmail.com>
wrote:

> I agree with Deepak and i would try to save data in parquet and avro
> format, if you can, try to measure the performance and choose the best, it
> will probably be parquet, but you have to know for yourself.
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
> 2016-05-04 9:22 GMT+02:00 Jörn Franke <jo...@gmail.com>:
>
>> Look at lambda architecture.
>>
>> What is the motivation of your migration?
>>
>> On 04 May 2016, at 03:29, Tapan Upadhyay <ta...@gmail.com> wrote:
>>
>> Hi,
>>
>> We are planning to move our adhoc queries from teradata to spark. We have
>> huge volume of queries during the day. What is best way to go about it -
>>
>> 1) Read data directly from teradata db using spark jdbc
>>
>> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
>> and then run queries on hive tables using spark sql or spark hive context.
>>
>> any other ways through which we can do it in a better/efficiently?
>>
>> Please guide.
>>
>> Regards,
>> Tapan
>>
>>
>

Re: migration from Teradata to Spark SQL

Posted by Alonso Isidoro Roman <al...@gmail.com>.

I agree with Deepak and i would try to save data in parquet and avro
format, if you can, try to measure the performance and choose the best, it
will probably be parquet, but you have to know for yourself.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 9:22 GMT+02:00 Jörn Franke <jo...@gmail.com>:

> Look at lambda architecture.
>
> What is the motivation of your migration?
>
> On 04 May 2016, at 03:29, Tapan Upadhyay <ta...@gmail.com> wrote:
>
> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Read data directly from teradata db using spark jdbc
>
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
> and then run queries on hive tables using spark sql or spark hive context.
>
> any other ways through which we can do it in a better/efficiently?
>
> Please guide.
>
> Regards,
> Tapan
>
>

Re: migration from Teradata to Spark SQL

Posted by Jörn Franke <jo...@gmail.com>.

Look at lambda architecture.

What is the motivation of your migration?

> On 04 May 2016, at 03:29, Tapan Upadhyay <ta...@gmail.com> wrote:
> 
> Hi,
> 
> We are planning to move our adhoc queries from teradata to spark. We have huge volume of queries during the day. What is best way to go about it - 
> 
> 1) Read data directly from teradata db using spark jdbc
> 
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet and then run queries on hive tables using spark sql or spark hive context.
> 
> any other ways through which we can do it in a better/efficiently?
> 
> Please guide.
> 
> Regards,
> Tapan
>

Re: migration from Teradata to Spark SQL

Posted by Deepak Sharma <de...@gmail.com>.

Hi Tapan
I would suggest an architecture where you have different storage layer and
data servng layer.
Spark is still best for batch processing of data.
So what i am suggesting here is you can have your data stored as it is in
some hdfs raw layer , run your ELT in spark on this raw data and further
store the processed/transformed data in some nosql db such as Cassandra to
server the data to you that can handle large number of queries for you,

Thanks
Deepak

On Wed, May 4, 2016 at 6:59 AM, Tapan Upadhyay <ta...@gmail.com> wrote:

> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Read data directly from teradata db using spark jdbc
>
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
> and then run queries on hive tables using spark sql or spark hive context.
>
> any other ways through which we can do it in a better/efficiently?
>
> Please guide.
>
> Regards,
> Tapan
>
>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net