You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Om...@sony.com on 2018/10/23 07:53:15 UTC
Triggering sql on Was S3 via Apache Spark
Hi guys,
We are using Apache Spark on a local machine.
I need to implement the scenario below.
In the initial load:
1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
* Customer.tsv contains customerid, country, birty_month, activation_date etc
2. I need to read the contents of customer.tsv.
3. I will add current timestamp info to the file.
4. I will transfer customer.tsv to the S3 bucket: customer.history.data
In the daily loads:
1. CRM application will send a new file which contains the updated/deleted/inserted customer information.
File name is daily_customer.tsv
* Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc
Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
1. I need to read the contents of daily_customer.tsv.
2. I will add current timestamp info to the file.
3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
4. I need to merge two buckets customer.history.data and customer.daily.data.
* Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
* I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
* Then I can put the records whose version is one, to the final bucket: customer.dimension.data
I am running Spark on premise.
* Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
* Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
* How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
* For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.
Thanks in advance,
Ömer
Re: Triggering sql on Was S3 via Apache Spark
Posted by Jörn Franke <jo...@gmail.com>.
Why not directly access the S3 file from Spark?
You need to configure the IAM roles so that the machine running the S3 code is allowed to access the bucket.
> Am 24.10.2018 um 06:40 schrieb Divya Gehlot <di...@gmail.com>:
>
> Hi Omer ,
> Here are couple of the solutions which you can implement for your use case :
> Option 1 :
> you can mount the S3 bucket as local file system
> Here are the details : https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
> Option 2 :
> You can use Amazon Glue for your use case
> here are the details : https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
>
> Option 3 :
> Store the file in the local file system and later push it s3 bucket
> here are the details https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket
>
> Thanks,
> Divya
>
>> On Tue, 23 Oct 2018 at 15:53, <Om...@sony.com> wrote:
>> Hi guys,
>>
>>
>>
>> We are using Apache Spark on a local machine.
>>
>>
>>
>> I need to implement the scenario below.
>>
>>
>>
>> In the initial load:
>>
>> CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
>> Customer.tsv contains customerid, country, birty_month, activation_date etc
>> I need to read the contents of customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer customer.tsv to the S3 bucket: customer.history.data
>>
>>
>> In the daily loads:
>>
>> CRM application will send a new file which contains the updated/deleted/inserted customer information.
>> File name is daily_customer.tsv
>>
>> Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc
>> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>>
>> I need to read the contents of daily_customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
>> I need to merge two buckets customer.history.data and customer.daily.data.
>> Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
>> I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
>> Then I can put the records whose version is one, to the final bucket: customer.dimension.data
>>
>>
>> I am running Spark on premise.
>>
>> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
>> Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
>> How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
>> For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.
>>
>>
>> Thanks in advance,
>>
>> Ömer
>>
>>
>>
>>
>>
>>
>>
>>
Re: Triggering sql on Was S3 via Apache Spark
Posted by Divya Gehlot <di...@gmail.com>.
Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
You can use Amazon Glue for your use case
here are the details :
https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
*Option 3 :*
Store the file in the local file system and later push it s3 bucket
here are the details
https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket
Thanks,
Divya
On Tue, 23 Oct 2018 at 15:53, <Om...@sony.com> wrote:
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
> 1. CRM application will send a file to a folder. This file contains
> customer information of all customers. This file is in a folder in the
> local server. File name is: customer.tsv
> 1. Customer.tsv contains customerid, country, birty_month,
> activation_date etc
> 2. I need to read the contents of customer.tsv.
> 3. I will add current timestamp info to the file.
> 4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
> 1. CRM application will send a new file which contains the
> updated/deleted/inserted customer information.
>
> File name is daily_customer.tsv
>
> 1. Daily_customer.tsv contains contains customerid, cdc_field,
> country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
> 1. I need to read the contents of daily_customer.tsv.
> 2. I will add current timestamp info to the file.
> 3. I will transfer daily_customer.tsv to the S3 bucket:
> customer.daily.data
> 4. I need to merge two buckets customer.history.data and
> customer.daily.data.
> 1. Two buckets have timestamp fields. So I need to query all
> records whose timestamp is the last timestamp.
> 2. I can use row_number() over(partition by customer_id order by
> timestamp_field desc) as version_number
> 3. Then I can put the records whose version is one, to the final
> bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
> - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
> on a local Spark cluster?
> - Is this approach efficient? Will the queries transfer all historical
> data from AWS S3 to the local cluster?
> - How can I implement this scenario in a more effective way? Like just
> transferring daily data to AWS S3 and then running queries on AWS.
> - For instance Athena can query on AWS. But it is just a query
> engine. As I know I can not call it by using an sdk and I can not write the
> results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
Re: Triggering sql on Was S3 via Apache Spark
Posted by Gourav Sengupta <go...@gmail.com>.
I do not think security and governance has become important it always was.
Horton works and Cloudera has fantastic security implementations and hence
I mentioned about updates via Hive.
Regards,
Gourav
On Wed, 24 Oct 2018, 17:32 , <Om...@sony.com> wrote:
> Thank you Gourav,
>
> Today I saw the article:
> https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
>
> It seems also interesting.
>
> I was in meeting, I will also watch it.
>
>
>
> *From: *Gourav Sengupta <go...@gmail.com>
> *Date: *24 October 2018 Wednesday 13:39
> *To: *"Ozsakarya, Omer" <Om...@sony.com>
> *Cc: *Spark Forum <us...@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> Also try to read about SCD and the fact that Hive may be a very good
> alternative as well for running updates on data
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Wed, 24 Oct 2018, 14:53 , <Om...@sony.com> wrote:
>
> Thank you very much 😊
>
>
>
> *From: *Gourav Sengupta <go...@gmail.com>
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" <Om...@sony.com>
> *Cc: *Spark Forum <us...@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com> wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
> 1. CRM application will send a file to a folder. This file contains
> customer information of all customers. This file is in a folder in the
> local server. File name is: customer.tsv
>
>
> 1. Customer.tsv contains customerid, country, birty_month,
> activation_date etc
>
>
> 1. I need to read the contents of customer.tsv.
> 2. I will add current timestamp info to the file.
> 3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
> 1. CRM application will send a new file which contains the
> updated/deleted/inserted customer information.
>
> File name is daily_customer.tsv
>
> 1. Daily_customer.tsv contains contains customerid, cdc_field,
> country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
> 1. I need to read the contents of daily_customer.tsv.
> 2. I will add current timestamp info to the file.
> 3. I will transfer daily_customer.tsv to the S3 bucket:
> customer.daily.data
> 4. I need to merge two buckets customer.history.data and
> customer.daily.data.
>
>
> 1. Two buckets have timestamp fields. So I need to query all records
> whose timestamp is the last timestamp.
> 2. I can use row_number() over(partition by customer_id order by
> timestamp_field desc) as version_number
> 3. Then I can put the records whose version is one, to the final
> bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
> - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
> on a local Spark cluster?
> - Is this approach efficient? Will the queries transfer all historical
> data from AWS S3 to the local cluster?
> - How can I implement this scenario in a more effective way? Like just
> transferring daily data to AWS S3 and then running queries on AWS.
>
>
> - For instance Athena can query on AWS. But it is just a query engine.
> As I know I can not call it by using an sdk and I can not write the results
> to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>
Re: Triggering sql on Was S3 via Apache Spark
Posted by Om...@sony.com.
Thank you Gourav,
Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
It seems also interesting.
I was in meeting, I will also watch it.
From: Gourav Sengupta <go...@gmail.com>
Date: 24 October 2018 Wednesday 13:39
To: "Ozsakarya, Omer" <Om...@sony.com>
Cc: Spark Forum <us...@spark.apache.org>
Subject: Re: Triggering sql on Was S3 via Apache Spark
Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data
Regards,
Gourav
On Wed, 24 Oct 2018, 14:53 , <Om...@sony.com>> wrote:
Thank you very much 😊
From: Gourav Sengupta <go...@gmail.com>>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <Om...@sony.com>>
Cc: Spark Forum <us...@spark.apache.org>>
Subject: Re: Triggering sql on Was S3 via Apache Spark
This is interesting you asked and then answered the questions (almost) as well
Regards,
Gourav
On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com>> wrote:
Hi guys,
We are using Apache Spark on a local machine.
I need to implement the scenario below.
In the initial load:
1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
* Customer.tsv contains customerid, country, birty_month, activation_date etc
1. I need to read the contents of customer.tsv.
2. I will add current timestamp info to the file.
3. I will transfer customer.tsv to the S3 bucket: customer.history.data
In the daily loads:
1. CRM application will send a new file which contains the updated/deleted/inserted customer information.
File name is daily_customer.tsv
* Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc
Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
1. I need to read the contents of daily_customer.tsv.
2. I will add current timestamp info to the file.
3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
4. I need to merge two buckets customer.history.data and customer.daily.data.
* Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
* I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
* Then I can put the records whose version is one, to the final bucket: customer.dimension.data
I am running Spark on premise.
* Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
* Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
* How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
* For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.
Thanks in advance,
Ömer
Re: Triggering sql on Was S3 via Apache Spark
Posted by Gourav Sengupta <go...@gmail.com>.
Also try to read about SCD and the fact that Hive may be a very good
alternative as well for running updates on data
Regards,
Gourav
On Wed, 24 Oct 2018, 14:53 , <Om...@sony.com> wrote:
> Thank you very much 😊
>
>
>
> *From: *Gourav Sengupta <go...@gmail.com>
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" <Om...@sony.com>
> *Cc: *Spark Forum <us...@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com> wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
> 1. CRM application will send a file to a folder. This file contains
> customer information of all customers. This file is in a folder in the
> local server. File name is: customer.tsv
>
>
> 1. Customer.tsv contains customerid, country, birty_month,
> activation_date etc
>
>
> 1. I need to read the contents of customer.tsv.
> 2. I will add current timestamp info to the file.
> 3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
> 1. CRM application will send a new file which contains the
> updated/deleted/inserted customer information.
>
> File name is daily_customer.tsv
>
> 1. Daily_customer.tsv contains contains customerid, cdc_field,
> country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
> 1. I need to read the contents of daily_customer.tsv.
> 2. I will add current timestamp info to the file.
> 3. I will transfer daily_customer.tsv to the S3 bucket:
> customer.daily.data
> 4. I need to merge two buckets customer.history.data and
> customer.daily.data.
>
>
> 1. Two buckets have timestamp fields. So I need to query all records
> whose timestamp is the last timestamp.
> 2. I can use row_number() over(partition by customer_id order by
> timestamp_field desc) as version_number
> 3. Then I can put the records whose version is one, to the final
> bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
> - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
> on a local Spark cluster?
> - Is this approach efficient? Will the queries transfer all historical
> data from AWS S3 to the local cluster?
> - How can I implement this scenario in a more effective way? Like just
> transferring daily data to AWS S3 and then running queries on AWS.
>
>
> - For instance Athena can query on AWS. But it is just a query engine.
> As I know I can not call it by using an sdk and I can not write the results
> to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>
Re: Triggering sql on Was S3 via Apache Spark
Posted by Om...@sony.com.
Thank you very much 😊
From: Gourav Sengupta <go...@gmail.com>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <Om...@sony.com>
Cc: Spark Forum <us...@spark.apache.org>
Subject: Re: Triggering sql on Was S3 via Apache Spark
This is interesting you asked and then answered the questions (almost) as well
Regards,
Gourav
On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com>> wrote:
Hi guys,
We are using Apache Spark on a local machine.
I need to implement the scenario below.
In the initial load:
1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
* Customer.tsv contains customerid, country, birty_month, activation_date etc
1. I need to read the contents of customer.tsv.
2. I will add current timestamp info to the file.
3. I will transfer customer.tsv to the S3 bucket: customer.history.data
In the daily loads:
1. CRM application will send a new file which contains the updated/deleted/inserted customer information.
File name is daily_customer.tsv
* Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc
Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
1. I need to read the contents of daily_customer.tsv.
2. I will add current timestamp info to the file.
3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
4. I need to merge two buckets customer.history.data and customer.daily.data.
* Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
* I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
* Then I can put the records whose version is one, to the final bucket: customer.dimension.data
I am running Spark on premise.
* Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
* Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
* How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
* For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.
Thanks in advance,
Ömer
Re: Triggering sql on Was S3 via Apache Spark
Posted by Gourav Sengupta <go...@gmail.com>.
This is interesting you asked and then answered the questions (almost) as
well
Regards,
Gourav
On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com> wrote:
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
> 1. CRM application will send a file to a folder. This file contains
> customer information of all customers. This file is in a folder in the
> local server. File name is: customer.tsv
> 1. Customer.tsv contains customerid, country, birty_month,
> activation_date etc
> 2. I need to read the contents of customer.tsv.
> 3. I will add current timestamp info to the file.
> 4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
> 1. CRM application will send a new file which contains the
> updated/deleted/inserted customer information.
>
> File name is daily_customer.tsv
>
> 1. Daily_customer.tsv contains contains customerid, cdc_field,
> country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
> 1. I need to read the contents of daily_customer.tsv.
> 2. I will add current timestamp info to the file.
> 3. I will transfer daily_customer.tsv to the S3 bucket:
> customer.daily.data
> 4. I need to merge two buckets customer.history.data and
> customer.daily.data.
> 1. Two buckets have timestamp fields. So I need to query all
> records whose timestamp is the last timestamp.
> 2. I can use row_number() over(partition by customer_id order by
> timestamp_field desc) as version_number
> 3. Then I can put the records whose version is one, to the final
> bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
> - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
> on a local Spark cluster?
> - Is this approach efficient? Will the queries transfer all historical
> data from AWS S3 to the local cluster?
> - How can I implement this scenario in a more effective way? Like just
> transferring daily data to AWS S3 and then running queries on AWS.
> - For instance Athena can query on AWS. But it is just a query
> engine. As I know I can not call it by using an sdk and I can not write the
> results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>