You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Conrad Lee <co...@parsely.com> on 2018/10/19 11:09:12 UTC

Transferring files between S3 and GCS

Hello Airflow community,

I'm interested in transferring data between S3 and Google Cloud Storage.  I
want to transfer data on the scale of hundreds of gigabytes to a few
terrabytes.

Airflow already has an operator that could be used for this use-case:
the S3ToGoogleCloudStorageOperator.
However, looking over its implementation it appears that all the data to be
transferred actually passes through the machine running airflow.  That
seems completely unnecessary to me, and will place a lot of burden on the
airflow workers and will be bottlenecked by the bandwidth of the workers.
It could even lead to out of disk errors like this one
<https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>
.

I would much rather use Google Cloud's 'Transfer Service' for doing
this--that way the airflow operator just needs to make an API call and
(optionally) keep polling the API until the transfer is done (this last bit
could be done in a sensor).  The heavy work of performing the transfer is
offloaded to the Transfer Service.

Was it an intentional design decision to avoid using the Google Transfer
Service?  If I create a PR that adds the ability to perform transfers with
the Google Transfer Service, should it

   - replace the existing operator
   - be an option on the existing operator (i.e., add an argument that
   toggles between 'local worker transfer' and 'google hosted transfer')
   - make a new operator

Thanks,
Conrad Lee

Re: Transferring files between S3 and GCS

Posted by Conrad Lee <co...@parsely.com>.
Thanks for the responses!

Chris: thanks, I would indeed be interested in your operator.  Could you
share the source code?

Chris and Guillermo: thanks for sharing your experience.  I can see why in
many situations you would not want to rely on the google transfer service.
In my experience the transfer service is by far the fastest way to move
hundreds of gigabytes or a few terrabytes of data between S3 and GCS.  I
haven't run into any reliability issues--maybe the service has become more
reliable recently.  That said, I've only every tried having a few transfers
run concurrently: maybe if I tried more I'd run into reliability issues.

So the idea solution seems like an option (disabled by default) to use the
google transfer service rather than actually downloading and uploading the
file to the worker machine/pod.  I might get around to contributing this
option myself in the next month or two.


On Fri, Oct 19, 2018 at 3:28 PM Guillermo Rodríguez Cano <ws...@gmail.com>
wrote:

>   Hello Conrad,
>
>  I reply since I was the one who sent this contribution initially hence its
> design.
>
>  The actual reason to develop this operator was precisely to avoid the
> Google Transfer Service as we found out that it was somewhat unreliable
> (the resources from GCP to retrieve data from S3, at the time when we took
> this approach, were shared among GCP customers hence a lower performance
> that couldn't be really predicted).
> Our main scenario for this operator has been so far retrieving files at
> certain time from S3 to do some processing thereafter and so on. The Google
> Transfer Service wouldn't guarantee the delivery on time (again, at the
> time I did this operator.
>
>  Unfortunately the performance of the operator depends on the scheduler and
> so the machine (or pod for that matter) where it is executed and you can
> possibly end up with a failed task due to lack of memory.
> This is not an operator's problem and rather the GCS hook implementation
> involved (if I am not mistaken as I am writing out of memory now). I recall
> that the task would literally create a copy of the file in the local memory
> after retrieving the file from S3 and until the transfer to GCS was
> complete.
>
> Ideally you would want a stream of data from one hook to another, but I
> think the GCS hook doesn't support it yet. There is (or recall something at
> least) some idea to convert all hooks to the new API provided to access GCP
> resources; and I have a remote idea that this new API would support
> something like this.
> I never got into the refactoring of the hook but I considered it, and it is
> clearly pending work, but it is a major one I believe given what involves.
>
> I think that replacing the current operator is not a good idea but adding
> the option to the operator of using the Google Transfer Service is better
> although note that it may leave some DAGs stuck for hours, so some output
> for the status would probably be good from the user's perspective.
>
> /Guillermo
>
>
> On Fri, Oct 19, 2018 at 2:22 PM Chris Fei <cf...@gmail.com> wrote:
>
> > I ran into the same issue and ended building a separate operator that
> > works as you describe, though I haven't submitted it as a PR. Happy to
> > share my implementation with you.
> > I found that it's useful to have both ways of transferring data.
> > Initially, I migrated all of my S3ToGCS tasks to use the transfer
> > service, but I found that its performance can be unreliable with some
> > combination of 1) transferring smaller datasets and 2) invoking many
> > transfers in parallel. The transfer service is a bit of a black box, so
> > when it doesn't work as expected you're stuck. Because of this, I ended
> > up migrating some of my tasks to the original implementation. I would
> > definitely keep both options around--I don't think I have a preference
> > between new operator vs a param on the existing operator.
> > Chris
> >
> >
> > On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote:
> > > Hello Airflow community,
> > >
> > > I'm interested in transferring data between S3 and Google Cloud
> > > Storage.  I> want to transfer data on the scale of hundreds of
> gigabytes
> > to a few
> > > terrabytes.
> > >
> > > Airflow already has an operator that could be used for this use-case:>
> > the S3ToGoogleCloudStorageOperator.
> > > However, looking over its implementation it appears that all the
> > > data to be> transferred actually passes through the machine running
> > airflow.  That> seems completely unnecessary to me, and will place a lot
> of
> > > burden on the> airflow workers and will be bottlenecked by the
> bandwidth
> > of the
> > > workers.> It could even lead to out of disk errors like this one
> > > <
> >
> https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device
> >>
> > .
> > >
> > > I would much rather use Google Cloud's 'Transfer Service' for doing
> > > this--that way the airflow operator just needs to make an API call and>
> > (optionally) keep polling the API until the transfer is done (this
> > > last bit> could be done in a sensor).  The heavy work of performing the
> > > transfer is> offloaded to the Transfer Service.
> > >
> > > Was it an intentional design decision to avoid using the Google
> > > Transfer> Service?  If I create a PR that adds the ability to perform
> > > transfers with> the Google Transfer Service, should it
> > >
> > >   - replace the existing operator
> > >   - be an option on the existing operator (i.e., add an argument that>
> >  toggles between 'local worker transfer' and 'google hosted
> > >   transfer')>   - make a new operator
> > >
> > > Thanks,
> > > Conrad Lee
> >
> >
>

Re: Transferring files between S3 and GCS

Posted by Guillermo Rodríguez Cano <ws...@gmail.com>.
  Hello Conrad,

 I reply since I was the one who sent this contribution initially hence its
design.

 The actual reason to develop this operator was precisely to avoid the
Google Transfer Service as we found out that it was somewhat unreliable
(the resources from GCP to retrieve data from S3, at the time when we took
this approach, were shared among GCP customers hence a lower performance
that couldn't be really predicted).
Our main scenario for this operator has been so far retrieving files at
certain time from S3 to do some processing thereafter and so on. The Google
Transfer Service wouldn't guarantee the delivery on time (again, at the
time I did this operator.

 Unfortunately the performance of the operator depends on the scheduler and
so the machine (or pod for that matter) where it is executed and you can
possibly end up with a failed task due to lack of memory.
This is not an operator's problem and rather the GCS hook implementation
involved (if I am not mistaken as I am writing out of memory now). I recall
that the task would literally create a copy of the file in the local memory
after retrieving the file from S3 and until the transfer to GCS was
complete.

Ideally you would want a stream of data from one hook to another, but I
think the GCS hook doesn't support it yet. There is (or recall something at
least) some idea to convert all hooks to the new API provided to access GCP
resources; and I have a remote idea that this new API would support
something like this.
I never got into the refactoring of the hook but I considered it, and it is
clearly pending work, but it is a major one I believe given what involves.

I think that replacing the current operator is not a good idea but adding
the option to the operator of using the Google Transfer Service is better
although note that it may leave some DAGs stuck for hours, so some output
for the status would probably be good from the user's perspective.

/Guillermo


On Fri, Oct 19, 2018 at 2:22 PM Chris Fei <cf...@gmail.com> wrote:

> I ran into the same issue and ended building a separate operator that
> works as you describe, though I haven't submitted it as a PR. Happy to
> share my implementation with you.
> I found that it's useful to have both ways of transferring data.
> Initially, I migrated all of my S3ToGCS tasks to use the transfer
> service, but I found that its performance can be unreliable with some
> combination of 1) transferring smaller datasets and 2) invoking many
> transfers in parallel. The transfer service is a bit of a black box, so
> when it doesn't work as expected you're stuck. Because of this, I ended
> up migrating some of my tasks to the original implementation. I would
> definitely keep both options around--I don't think I have a preference
> between new operator vs a param on the existing operator.
> Chris
>
>
> On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote:
> > Hello Airflow community,
> >
> > I'm interested in transferring data between S3 and Google Cloud
> > Storage.  I> want to transfer data on the scale of hundreds of gigabytes
> to a few
> > terrabytes.
> >
> > Airflow already has an operator that could be used for this use-case:>
> the S3ToGoogleCloudStorageOperator.
> > However, looking over its implementation it appears that all the
> > data to be> transferred actually passes through the machine running
> airflow.  That> seems completely unnecessary to me, and will place a lot of
> > burden on the> airflow workers and will be bottlenecked by the bandwidth
> of the
> > workers.> It could even lead to out of disk errors like this one
> > <
> https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>>
> .
> >
> > I would much rather use Google Cloud's 'Transfer Service' for doing
> > this--that way the airflow operator just needs to make an API call and>
> (optionally) keep polling the API until the transfer is done (this
> > last bit> could be done in a sensor).  The heavy work of performing the
> > transfer is> offloaded to the Transfer Service.
> >
> > Was it an intentional design decision to avoid using the Google
> > Transfer> Service?  If I create a PR that adds the ability to perform
> > transfers with> the Google Transfer Service, should it
> >
> >   - replace the existing operator
> >   - be an option on the existing operator (i.e., add an argument that>
>  toggles between 'local worker transfer' and 'google hosted
> >   transfer')>   - make a new operator
> >
> > Thanks,
> > Conrad Lee
>
>

Re: Transferring files between S3 and GCS

Posted by Chris Fei <cf...@gmail.com>.
I ran into the same issue and ended building a separate operator that
works as you describe, though I haven't submitted it as a PR. Happy to
share my implementation with you.
I found that it's useful to have both ways of transferring data.
Initially, I migrated all of my S3ToGCS tasks to use the transfer
service, but I found that its performance can be unreliable with some
combination of 1) transferring smaller datasets and 2) invoking many
transfers in parallel. The transfer service is a bit of a black box, so
when it doesn't work as expected you're stuck. Because of this, I ended
up migrating some of my tasks to the original implementation. I would
definitely keep both options around--I don't think I have a preference
between new operator vs a param on the existing operator.
Chris


On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote:
> Hello Airflow community,
> 
> I'm interested in transferring data between S3 and Google Cloud
> Storage.  I> want to transfer data on the scale of hundreds of gigabytes to a few
> terrabytes.
> 
> Airflow already has an operator that could be used for this use-case:> the S3ToGoogleCloudStorageOperator.
> However, looking over its implementation it appears that all the
> data to be> transferred actually passes through the machine running airflow.  That> seems completely unnecessary to me, and will place a lot of
> burden on the> airflow workers and will be bottlenecked by the bandwidth of the
> workers.> It could even lead to out of disk errors like this one
> <https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>> .
> 
> I would much rather use Google Cloud's 'Transfer Service' for doing
> this--that way the airflow operator just needs to make an API call and> (optionally) keep polling the API until the transfer is done (this
> last bit> could be done in a sensor).  The heavy work of performing the
> transfer is> offloaded to the Transfer Service.
> 
> Was it an intentional design decision to avoid using the Google
> Transfer> Service?  If I create a PR that adds the ability to perform
> transfers with> the Google Transfer Service, should it
> 
>   - replace the existing operator
>   - be an option on the existing operator (i.e., add an argument that>   toggles between 'local worker transfer' and 'google hosted
>   transfer')>   - make a new operator
> 
> Thanks,
> Conrad Lee