You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sam Elamin <hu...@gmail.com> on 2017/02/12 11:04:19 UTC

Etl with spark

Hey folks

Really simple question here. I currently have an etl pipeline that reads
from s3 and saves the data to an endstore


I have to read from a list of keys in s3 but I am doing a raw extract then
saving. Only some of the extracts have a simple transformation but overall
the code looks the same


I abstracted away this logic into a method that takes in an s3 path does
the common transformations and saves to source


But the job takes about 10 mins or so because I'm iteratively going down a
list of keys

Is it possible to asynchronously do this?

FYI I'm using spark.read.json to read from s3 because it infers my schema

Regards
Sam

Re: Etl with spark

Posted by Sam Elamin <hu...@gmail.com>.

Yup I ended up doing just that thank you both
On Sun, 12 Feb 2017 at 18:33, Miguel Morales <th...@gmail.com>
wrote:

> You can parallelize the collection of s3 keys and then pass that to your
> map function so that files are read in parallel.
>
> Sent from my iPhone
>
> On Feb 12, 2017, at 9:41 AM, Sam Elamin <hu...@gmail.com> wrote:
>
> thanks Ayan but i was hoping to remove the dependency on a file and just
> use in memory list or dictionary
>
> So from the reading I've done today it seems.the concept of a bespoke
> async method doesn't really apply in spsrk since the cluster deals with
> distributing the work load
>
>
> Am I mistaken?
>
> Regards
> Sam
> On Sun, 12 Feb 2017 at 12:13, ayan guha <gu...@gmail.com> wrote:
>
> You can store the list of keys (I believe you use them in source file
> path, right?) in a file, one key per line. Then you can read the file using
> sc.textFile (So you will get a RDD of file paths) and then apply your
> function as a map.
>
> r = sc.textFile(list_file).map(your_function)
>
> HTH
>
> On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
> Hey folks
>
> Really simple question here. I currently have an etl pipeline that reads
> from s3 and saves the data to an endstore
>
>
> I have to read from a list of keys in s3 but I am doing a raw extract then
> saving. Only some of the extracts have a simple transformation but overall
> the code looks the same
>
>
> I abstracted away this logic into a method that takes in an s3 path does
> the common transformations and saves to source
>
>
> But the job takes about 10 mins or so because I'm iteratively going down a
> list of keys
>
> Is it possible to asynchronously do this?
>
> FYI I'm using spark.read.json to read from s3 because it infers my schema
>
> Regards
> Sam
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>

Re: Etl with spark

Posted by Miguel Morales <th...@gmail.com>.

You can parallelize the collection of s3 keys and then pass that to your map function so that files are read in parallel.

Sent from my iPhone

> On Feb 12, 2017, at 9:41 AM, Sam Elamin <hu...@gmail.com> wrote:
> 
> thanks Ayan but i was hoping to remove the dependency on a file and just use in memory list or dictionary 
> 
> So from the reading I've done today it seems.the concept of a bespoke async method doesn't really apply in spsrk since the cluster deals with distributing the work load 
> 
> 
> Am I mistaken?
> 
> Regards 
> Sam 
> On Sun, 12 Feb 2017 at 12:13, ayan guha <gu...@gmail.com> wrote:
> You can store the list of keys (I believe you use them in source file path, right?) in a file, one key per line. Then you can read the file using sc.textFile (So you will get a RDD of file paths) and then apply your function as a map.
> 
> r = sc.textFile(list_file).map(your_function)
> 
> HTH
> 
> On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hu...@gmail.com> wrote:
> Hey folks 
> 
> Really simple question here. I currently have an etl pipeline that reads from s3 and saves the data to an endstore 
> 
> 
> I have to read from a list of keys in s3 but I am doing a raw extract then saving. Only some of the extracts have a simple transformation but overall the code looks the same 
> 
> 
> I abstracted away this logic into a method that takes in an s3 path does the common transformations and saves to source 
> 
> 
> But the job takes about 10 mins or so because I'm iteratively going down a list of keys 
> 
> Is it possible to asynchronously do this?
> 
> FYI I'm using spark.read.json to read from s3 because it infers my schema 
> 
> Regards 
> Sam 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Re: Etl with spark

Posted by Sam Elamin <hu...@gmail.com>.

thanks Ayan but i was hoping to remove the dependency on a file and just
use in memory list or dictionary

So from the reading I've done today it seems.the concept of a bespoke async
method doesn't really apply in spsrk since the cluster deals with
distributing the work load


Am I mistaken?

Regards
Sam
On Sun, 12 Feb 2017 at 12:13, ayan guha <gu...@gmail.com> wrote:

You can store the list of keys (I believe you use them in source file path,
right?) in a file, one key per line. Then you can read the file using
sc.textFile (So you will get a RDD of file paths) and then apply your
function as a map.

r = sc.textFile(list_file).map(your_function)

HTH

On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hu...@gmail.com>
wrote:

Hey folks

Really simple question here. I currently have an etl pipeline that reads
from s3 and saves the data to an endstore


I have to read from a list of keys in s3 but I am doing a raw extract then
saving. Only some of the extracts have a simple transformation but overall
the code looks the same


I abstracted away this logic into a method that takes in an s3 path does
the common transformations and saves to source


But the job takes about 10 mins or so because I'm iteratively going down a
list of keys

Is it possible to asynchronously do this?

FYI I'm using spark.read.json to read from s3 because it infers my schema

Regards
Sam




-- 
Best Regards,
Ayan Guha

Re: Etl with spark

Posted by ayan guha <gu...@gmail.com>.

You can store the list of keys (I believe you use them in source file path,
right?) in a file, one key per line. Then you can read the file using
sc.textFile (So you will get a RDD of file paths) and then apply your
function as a map.

r = sc.textFile(list_file).map(your_function)

HTH

On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hu...@gmail.com>
wrote:

> Hey folks
>
> Really simple question here. I currently have an etl pipeline that reads
> from s3 and saves the data to an endstore
>
>
> I have to read from a list of keys in s3 but I am doing a raw extract then
> saving. Only some of the extracts have a simple transformation but overall
> the code looks the same
>
>
> I abstracted away this logic into a method that takes in an s3 path does
> the common transformations and saves to source
>
>
> But the job takes about 10 mins or so because I'm iteratively going down a
> list of keys
>
> Is it possible to asynchronously do this?
>
> FYI I'm using spark.read.json to read from s3 because it infers my schema
>
> Regards
> Sam
>



-- 
Best Regards,
Ayan Guha