You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Om...@sony.com on 2018/10/23 07:53:15 UTC

Triggering sql on Was S3 via Apache Spark

Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
* Customer.tsv contains customerid, country, birty_month, activation_date etc
2. I need to read the contents of customer.tsv.
3. I will add current timestamp info to the file.
4. I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

1. CRM application will send a new file which contains the updated/deleted/inserted customer information.

File name is daily_customer.tsv

* Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

1. I need to read the contents of daily_customer.tsv.
2. I will add current timestamp info to the file.
3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
4. I need to merge two buckets customer.history.data and customer.daily.data.
* Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
* I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
* Then I can put the records whose version is one, to the final bucket: customer.dimension.data

I am running Spark on premise.

* Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
* Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
* How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
* For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

Thanks in advance,
Ömer

Re: Triggering sql on Was S3 via Apache Spark

Posted by Jörn Franke <jo...@gmail.com>.

Why not directly access the S3 file from Spark?


You need to configure the IAM roles so that the machine running the S3 code is allowed to access the bucket.

> Am 24.10.2018 um 06:40 schrieb Divya Gehlot <di...@gmail.com>:
> 
> Hi Omer ,
> Here are couple of the solutions which you can implement for your use case : 
> Option 1 : 
> you can mount the S3 bucket as local file system 
> Here are the details : https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
> Option 2 :
>  You can use Amazon Glue for your use case 
> here are the details : https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
> 
> Option 3 :
> Store the file in the local file system and later push it s3 bucket 
> here are the details https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket
> 
> Thanks,
> Divya 
> 
>> On Tue, 23 Oct 2018 at 15:53, <Om...@sony.com> wrote:
>> Hi guys,
>> 
>>  
>> 
>> We are using Apache Spark on a local machine.
>> 
>>  
>> 
>> I need to implement the scenario below.
>> 
>>  
>> 
>> In the initial load:
>> 
>> CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
>> Customer.tsv contains customerid, country, birty_month, activation_date etc
>> I need to read the contents of customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer customer.tsv to the S3 bucket: customer.history.data
>>  
>> 
>> In the daily loads:
>> 
>>  CRM application will send a new file which contains the updated/deleted/inserted customer information.
>>   File name is daily_customer.tsv
>> 
>> Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc
>> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>> 
>> I need to read the contents of daily_customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
>> I need to merge two buckets customer.history.data and customer.daily.data.
>> Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
>> I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
>> Then I can put the records whose version is one, to the final bucket: customer.dimension.data
>>  
>> 
>> I am running Spark on premise.
>> 
>> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
>> Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
>> How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
>> For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.
>>  
>> 
>> Thanks in advance,
>> 
>> Ömer
>> 
>>  
>> 
>>            
>> 
>>  
>> 
>>

Re: Triggering sql on Was S3 via Apache Spark

Posted by Divya Gehlot <di...@gmail.com>.

Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
 You can use Amazon Glue for your use case
here are the details :
https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/

*Option 3 :*
Store the file in the local file system and later push it s3 bucket
here are the details
https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket

Thanks,
Divya

On Tue, 23 Oct 2018 at 15:53, <Om...@sony.com> wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>       1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>    2. I need to read the contents of customer.tsv.
>    3. I will add current timestamp info to the file.
>    4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>       1. Two buckets have timestamp fields. So I need to query all
>       records whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>       - For instance Athena can query on AWS. But it is just a query
>       engine. As I know I can not call it by using an sdk and I can not write the
>       results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>

Re: Triggering sql on Was S3 via Apache Spark

Posted by Gourav Sengupta <go...@gmail.com>.

I do not think security and governance has become important it always was.
Horton works and Cloudera has fantastic security implementations and hence
I mentioned about updates via Hive.

Regards,
Gourav

On Wed, 24 Oct 2018, 17:32 , <Om...@sony.com> wrote:

> Thank you Gourav,
>
> Today I saw the article:
> https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
>
> It seems also interesting.
>
> I was in meeting, I will also watch it.
>
>
>
> *From: *Gourav Sengupta <go...@gmail.com>
> *Date: *24 October 2018 Wednesday 13:39
> *To: *"Ozsakarya, Omer" <Om...@sony.com>
> *Cc: *Spark Forum <us...@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> Also try to read about SCD and the fact that Hive may be a very good
> alternative as well for running updates on data
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Wed, 24 Oct 2018, 14:53 , <Om...@sony.com> wrote:
>
> Thank you very much 😊
>
>
>
> *From: *Gourav Sengupta <go...@gmail.com>
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" <Om...@sony.com>
> *Cc: *Spark Forum <us...@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com> wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>
>
>    1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>
>
>    1. I need to read the contents of customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>
>
>    1. Two buckets have timestamp fields. So I need to query all records
>       whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>
>
>    - For instance Athena can query on AWS. But it is just a query engine.
>       As I know I can not call it by using an sdk and I can not write the results
>       to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>

Re: Triggering sql on Was S3 via Apache Spark

Posted by Om...@sony.com.

Thank you Gourav,

Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
It seems also interesting.
I was in meeting, I will also watch it.

From: Gourav Sengupta <go...@gmail.com>
Date: 24 October 2018 Wednesday 13:39
To: "Ozsakarya, Omer" <Om...@sony.com>
Cc: Spark Forum <us...@spark.apache.org>
Subject: Re: Triggering sql on Was S3 via Apache Spark

Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data

Regards,
Gourav

On Wed, 24 Oct 2018, 14:53 , <Om...@sony.com>> wrote:
Thank you very much 😊

From: Gourav Sengupta <go...@gmail.com>>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <Om...@sony.com>>
Cc: Spark Forum <us...@spark.apache.org>>
Subject: Re: Triggering sql on Was S3 via Apache Spark

This is interesting you asked and then answered the questions (almost) as well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com>> wrote:
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv

     *   Customer.tsv contains customerid, country, birty_month, activation_date etc

  1.  I need to read the contents of customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

     *   Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.

     *   Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
     *   I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
     *   Then I can put the records whose version is one, to the final bucket: customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.

     *   For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

Thanks in advance,
Ömer

Re: Triggering sql on Was S3 via Apache Spark

Posted by Gourav Sengupta <go...@gmail.com>.

Also try to read about SCD and the fact that Hive may be a very good
alternative as well for running updates on data

Regards,
Gourav

On Wed, 24 Oct 2018, 14:53 , <Om...@sony.com> wrote:

> Thank you very much 😊
>
>
>
> *From: *Gourav Sengupta <go...@gmail.com>
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" <Om...@sony.com>
> *Cc: *Spark Forum <us...@spark.apache.org>
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com> wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>
>
>    1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>
>
>    1. I need to read the contents of customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>
>
>    1. Two buckets have timestamp fields. So I need to query all records
>       whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>
>
>    - For instance Athena can query on AWS. But it is just a query engine.
>       As I know I can not call it by using an sdk and I can not write the results
>       to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>

Re: Triggering sql on Was S3 via Apache Spark

Posted by Om...@sony.com.

Thank you very much 😊

From: Gourav Sengupta <go...@gmail.com>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <Om...@sony.com>
Cc: Spark Forum <us...@spark.apache.org>
Subject: Re: Triggering sql on Was S3 via Apache Spark

This is interesting you asked and then answered the questions (almost) as well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com>> wrote:
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv

     *   Customer.tsv contains customerid, country, birty_month, activation_date etc

  1.  I need to read the contents of customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

     *   Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.

     *   Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
     *   I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
     *   Then I can put the records whose version is one, to the final bucket: customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.

     *   For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

Thanks in advance,
Ömer

Re: Triggering sql on Was S3 via Apache Spark

Posted by Gourav Sengupta <go...@gmail.com>.

This is interesting you asked and then answered the questions (almost) as
well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , <Om...@sony.com> wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>    1. CRM application will send a file to a folder. This file contains
>    customer information of all customers. This file is in a folder in the
>    local server. File name is: customer.tsv
>       1. Customer.tsv contains customerid, country, birty_month,
>       activation_date etc
>    2. I need to read the contents of customer.tsv.
>    3. I will add current timestamp info to the file.
>    4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>    1.  CRM application will send a new file which contains the
>    updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>    1. Daily_customer.tsv contains contains customerid, cdc_field,
>       country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>    1. I need to read the contents of daily_customer.tsv.
>    2. I will add current timestamp info to the file.
>    3. I will transfer daily_customer.tsv to the S3 bucket:
>    customer.daily.data
>    4. I need to merge two buckets customer.history.data and
>    customer.daily.data.
>       1. Two buckets have timestamp fields. So I need to query all
>       records whose timestamp is the last timestamp.
>       2. I can use row_number() over(partition by customer_id order by
>       timestamp_field desc) as version_number
>       3. Then I can put the records whose version is one, to the final
>       bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>    - Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>    on a local Spark cluster?
>    - Is this approach efficient? Will the queries transfer all historical
>    data from AWS S3 to the local cluster?
>    - How can I implement this scenario in a more effective way? Like just
>    transferring daily data to AWS S3 and then running queries on AWS.
>       - For instance Athena can query on AWS. But it is just a query
>       engine. As I know I can not call it by using an sdk and I can not write the
>       results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>