You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Ramya Prasad via user <us...@beam.apache.org> on 2023/12/22 16:27:35 UTC

[Question] S3 Token Expiration during Read Step

Hello,

I am a developer trying to use Apache Beam, and I have a nuanced problem I
need help with. I have a pipeline which has to read in 40 million records
from multiple Parquet files from AWS S3. The only way I can get the
credentials I need for this particular bucket is to call an API, which I do
before the pipeline executes, and then I store the credentials in the
PipelineOptions for the pipeline to use during the read. However, the
credentials are only valid for one hour, and my pipeline takes longer than
one hour to run. So after an hour of execution, the pipeline fails with a
credentials invalidation error. The only way I can refresh the credentials
is by calling the API. Is there a way for me to do this in my pipeline
while it's running?

Any help would be appreciated!

Thanks and sincerely,
Ramya

______________________________________________________________________

The information contained in this e-mail may be confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [External Sender] Re: [Question] S3 Token Expiration during Read Step

Posted by Ramya Prasad via user <us...@beam.apache.org>.

Oops, not sure I replied to all but I'm using ParquetIO:

PCollection<GenericRecord> records = pipeline.apply("Read parquet file
in as Generic Records",
ParquetIO.read(finalSchema).from(beamReadPath).withConfiguration(configuration));

The variable beamReadPath starts with the s3 prefix, and I set the
initial credentials in the PipelineOptions object before the pipeline
is initialized.

On Fri, Dec 22, 2023 at 10:39 AM XQ Hu via user <us...@beam.apache.org>
wrote:

> Can you share some code snippets about how to read from S3? Do you use the
> builtin TextIO?
>
> On Fri, Dec 22, 2023 at 11:28 AM Ramya Prasad via user <
> user@beam.apache.org> wrote:
>
>> Hello,
>>
>> I am a developer trying to use Apache Beam, and I have a nuanced problem
>> I need help with. I have a pipeline which has to read in 40 million records
>> from multiple Parquet files from AWS S3. The only way I can get the
>> credentials I need for this particular bucket is to call an API, which I do
>> before the pipeline executes, and then I store the credentials in the
>> PipelineOptions for the pipeline to use during the read. However, the
>> credentials are only valid for one hour, and my pipeline takes longer than
>> one hour to run. So after an hour of execution, the pipeline fails with a
>> credentials invalidation error. The only way I can refresh the credentials
>> is by calling the API. Is there a way for me to do this in my pipeline
>> while it's running?
>>
>> Any help would be appreciated!
>>
>> Thanks and sincerely,
>> Ramya
>> ------------------------------
>>
>> The information contained in this e-mail may be confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>>
>>
>>
>>

______________________________________________________________________

The information contained in this e-mail may be confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [External Sender] Re: [Question] S3 Token Expiration during Read Step

Posted by Ramya Prasad via user <us...@beam.apache.org>.

Yes, I'm using ParquetIO as below:

PCollection<GenericRecord> records = pipeline.apply("Read parquet file
in as Generic Records",
ParquetIO.read(finalSchema).from(beamReadPath).withConfiguration(configuration));


On Fri, Dec 22, 2023 at 10:39 AM XQ Hu via user <us...@beam.apache.org>
wrote:

> Can you share some code snippets about how to read from S3? Do you use the
> builtin TextIO?
>
> On Fri, Dec 22, 2023 at 11:28 AM Ramya Prasad via user <
> user@beam.apache.org> wrote:
>
>> Hello,
>>
>> I am a developer trying to use Apache Beam, and I have a nuanced problem
>> I need help with. I have a pipeline which has to read in 40 million records
>> from multiple Parquet files from AWS S3. The only way I can get the
>> credentials I need for this particular bucket is to call an API, which I do
>> before the pipeline executes, and then I store the credentials in the
>> PipelineOptions for the pipeline to use during the read. However, the
>> credentials are only valid for one hour, and my pipeline takes longer than
>> one hour to run. So after an hour of execution, the pipeline fails with a
>> credentials invalidation error. The only way I can refresh the credentials
>> is by calling the API. Is there a way for me to do this in my pipeline
>> while it's running?
>>
>> Any help would be appreciated!
>>
>> Thanks and sincerely,
>> Ramya
>> ------------------------------
>>
>> The information contained in this e-mail may be confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>>
>>
>>
>>

______________________________________________________________________



The information contained in this e-mail may be confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [Question] S3 Token Expiration during Read Step

Posted by XQ Hu via user <us...@beam.apache.org>.

Can you share some code snippets about how to read from S3? Do you use the
builtin TextIO?

On Fri, Dec 22, 2023 at 11:28 AM Ramya Prasad via user <us...@beam.apache.org>
wrote:

> Hello,
>
> I am a developer trying to use Apache Beam, and I have a nuanced problem I
> need help with. I have a pipeline which has to read in 40 million records
> from multiple Parquet files from AWS S3. The only way I can get the
> credentials I need for this particular bucket is to call an API, which I do
> before the pipeline executes, and then I store the credentials in the
> PipelineOptions for the pipeline to use during the read. However, the
> credentials are only valid for one hour, and my pipeline takes longer than
> one hour to run. So after an hour of execution, the pipeline fails with a
> credentials invalidation error. The only way I can refresh the credentials
> is by calling the API. Is there a way for me to do this in my pipeline
> while it's running?
>
> Any help would be appreciated!
>
> Thanks and sincerely,
> Ramya
> ------------------------------
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>