You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Joe Cullen <jo...@gmail.com> on 2018/11/29 11:52:56 UTC

Inferring Csv Schemas

Hi all,

I have a pipeline reading CSV files, performing some transforms, and
writing to BigQuery. At the moment I'm reading the BigQuery schema from a
separate JSON file. If the CSV files had a new column added (and I wanted
to include this column in the resultant BigQuery table), I'd have to change
the JSON schema or the pipeline itself. Is there any way to autodetect the
schema using BigQueryIO? How do people normally deal with potential changes
to input CSVs?

Thanks,
Joe

Re: Inferring Csv Schemas

Posted by Reza Rokni <re...@google.com>.

Hi,

No problem and warm welcome to your Beam journey :-)

Yes, in this case when a issue was found the BigQuery API was used to make
the changes.

Cheers
Reza

On Sat, 1 Dec 2018 at 03:38, Joe Cullen <jo...@gmail.com> wrote:

> That's great Reza, thanks! I'm still getting to grips with Beam and
> Dataflow so apologies for all the questions. I have a few more if that's ok:
>
> 1. When the article says "the schema would be mutated", does this mean the
> BigQuery schema?
> 2. Also, when the known good BigQuery schema is retrieved, and if it's the
> BigQuery schema being updated in the question above, is this done with the
> BigQuery API rather than BigQueryIO? In other words, what is the process
> behind the step "validate and mutate BQ schema" in the image?
>
> Thanks,
> Joe
>
> On 30 Nov 2018 16:48, "Reza Rokni" <re...@google.com> wrote:
>
> Hi Joe,
>
> That part of the blog should have been written a bit cleaner.. I blame the
> writer ;-) So while that solution worked it was inefficient, this is
> discussed in the next paragraph.. But essentially checking the validity of
> the schema every time is not efficient, especially as they are normally ok.
> So the next paragraph was..
>
>
>
> *However, this design could not make use of the inbuilt efficiencies that
> BigQueryIO provided, and also burdened us with technical debt.Chris then
> tried various other tactics to beat the boss. In his words ... "The first
> attempt at fixing this inefficiency was to remove the costly JSON schema
> detection ‘DoFn’ which every metric goes through, and move it to a ‘failed
> inserts’ section of the pipeline, which is only run when there are errors
> on inserting into BigQuery,”*
>
> Cheers
> Reza
>
> On Fri, 30 Nov 2018 at 09:01, Joe Cullen <jo...@gmail.com>
> wrote:
>
>> Thanks Reza, that's really helpful!
>>
>> I have a few questions:
>>
>> "He used a GroupByKey function on the JSON type and then a manual check
>> on the JSON schema against the known good BigQuery schema. If there was a
>> difference, the schema would mutate and the updates would be pushed
>> through."
>>
>> If the difference was a new column had been added to the JSON elements,
>> does there need to be any mutation? The JSON schema derived from the JSON
>> elements would already have this new column, and if BigQuery allows for
>> additive schema changes then this new JSON schema should be fine, right?
>>
>> But then I'm not sure how you would enter the 'failed inserts' section of
>> the pipeline (as the insert should have been successful).
>>
>> Have I misunderstood what is being mutated?
>>
>> Thanks,
>> Joe
>>
>> On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <rarokni@gmail.com wrote:
>>
>>> Hi Joe,
>>>
>>> You may find some of the info in this blog of interest, its based on
>>> streaming pipelines but useful ideas.
>>>
>>>
>>> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>>>
>>> Cheers
>>>
>>> Reza
>>>
>>> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <jo...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a pipeline reading CSV files, performing some transforms, and
>>>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
>>>> separate JSON file. If the CSV files had a new column added (and I wanted
>>>> to include this column in the resultant BigQuery table), I'd have to change
>>>> the JSON schema or the pipeline itself. Is there any way to autodetect the
>>>> schema using BigQueryIO? How do people normally deal with potential changes
>>>> to input CSVs?
>>>>
>>>> Thanks,
>>>> Joe
>>>>
>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>
>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Inferring Csv Schemas

Posted by Joe Cullen <jo...@gmail.com>.

That's great Reza, thanks! I'm still getting to grips with Beam and
Dataflow so apologies for all the questions. I have a few more if that's ok:

1. When the article says "the schema would be mutated", does this mean the
BigQuery schema?
2. Also, when the known good BigQuery schema is retrieved, and if it's the
BigQuery schema being updated in the question above, is this done with the
BigQuery API rather than BigQueryIO? In other words, what is the process
behind the step "validate and mutate BQ schema" in the image?

Thanks,
Joe

On 30 Nov 2018 16:48, "Reza Rokni" <re...@google.com> wrote:

Hi Joe,

That part of the blog should have been written a bit cleaner.. I blame the
writer ;-) So while that solution worked it was inefficient, this is
discussed in the next paragraph.. But essentially checking the validity of
the schema every time is not efficient, especially as they are normally ok.
So the next paragraph was..

*However, this design could not make use of the inbuilt efficiencies that
BigQueryIO provided, and also burdened us with technical debt.Chris then
tried various other tactics to beat the boss. In his words ... "The first
attempt at fixing this inefficiency was to remove the costly JSON schema
detection ‘DoFn’ which every metric goes through, and move it to a ‘failed
inserts’ section of the pipeline, which is only run when there are errors
on inserting into BigQuery,”*

Cheers
Reza

On Fri, 30 Nov 2018 at 09:01, Joe Cullen <jo...@gmail.com> wrote:

> Thanks Reza, that's really helpful!
>
> I have a few questions:
>
> "He used a GroupByKey function on the JSON type and then a manual check on
> the JSON schema against the known good BigQuery schema. If there was a
> difference, the schema would mutate and the updates would be pushed
> through."
>
> If the difference was a new column had been added to the JSON elements,
> does there need to be any mutation? The JSON schema derived from the JSON
> elements would already have this new column, and if BigQuery allows for
> additive schema changes then this new JSON schema should be fine, right?
>
> But then I'm not sure how you would enter the 'failed inserts' section of
> the pipeline (as the insert should have been successful).
>
> Have I misunderstood what is being mutated?
>
> Thanks,
> Joe
>
> On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <rarokni@gmail.com wrote:
>
>> Hi Joe,
>>
>> You may find some of the info in this blog of interest, its based on
>> streaming pipelines but useful ideas.
>>
>>
>> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>>
>> Cheers
>>
>> Reza
>>
>> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <jo...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a pipeline reading CSV files, performing some transforms, and
>>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
>>> separate JSON file. If the CSV files had a new column added (and I wanted
>>> to include this column in the resultant BigQuery table), I'd have to change
>>> the JSON schema or the pipeline itself. Is there any way to autodetect the
>>> schema using BigQueryIO? How do people normally deal with potential changes
>>> to input CSVs?
>>>
>>> Thanks,
>>> Joe
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Inferring Csv Schemas

Posted by Reza Rokni <re...@google.com>.

Hi Joe,

That part of the blog should have been written a bit cleaner.. I blame the
writer ;-) So while that solution worked it was inefficient, this is
discussed in the next paragraph.. But essentially checking the validity of
the schema every time is not efficient, especially as they are normally ok.
So the next paragraph was..

*However, this design could not make use of the inbuilt efficiencies that
BigQueryIO provided, and also burdened us with technical debt.Chris then
tried various other tactics to beat the boss. In his words ... "The first
attempt at fixing this inefficiency was to remove the costly JSON schema
detection ‘DoFn’ which every metric goes through, and move it to a ‘failed
inserts’ section of the pipeline, which is only run when there are errors
on inserting into BigQuery,”*

Cheers
Reza

On Fri, 30 Nov 2018 at 09:01, Joe Cullen <jo...@gmail.com> wrote:

> Thanks Reza, that's really helpful!
>
> I have a few questions:
>
> "He used a GroupByKey function on the JSON type and then a manual check on
> the JSON schema against the known good BigQuery schema. If there was a
> difference, the schema would mutate and the updates would be pushed
> through."
>
> If the difference was a new column had been added to the JSON elements,
> does there need to be any mutation? The JSON schema derived from the JSON
> elements would already have this new column, and if BigQuery allows for
> additive schema changes then this new JSON schema should be fine, right?
>
> But then I'm not sure how you would enter the 'failed inserts' section of
> the pipeline (as the insert should have been successful).
>
> Have I misunderstood what is being mutated?
>
> Thanks,
> Joe
>
> On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <rarokni@gmail.com wrote:
>
>> Hi Joe,
>>
>> You may find some of the info in this blog of interest, its based on
>> streaming pipelines but useful ideas.
>>
>>
>> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>>
>> Cheers
>>
>> Reza
>>
>> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <jo...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a pipeline reading CSV files, performing some transforms, and
>>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
>>> separate JSON file. If the CSV files had a new column added (and I wanted
>>> to include this column in the resultant BigQuery table), I'd have to change
>>> the JSON schema or the pipeline itself. Is there any way to autodetect the
>>> schema using BigQueryIO? How do people normally deal with potential changes
>>> to input CSVs?
>>>
>>> Thanks,
>>> Joe
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Inferring Csv Schemas

Posted by Joe Cullen <jo...@gmail.com>.

Thanks Reza, that's really helpful!

I have a few questions:

"He used a GroupByKey function on the JSON type and then a manual check on
the JSON schema against the known good BigQuery schema. If there was a
difference, the schema would mutate and the updates would be pushed
through."

If the difference was a new column had been added to the JSON elements,
does there need to be any mutation? The JSON schema derived from the JSON
elements would already have this new column, and if BigQuery allows for
additive schema changes then this new JSON schema should be fine, right?

But then I'm not sure how you would enter the 'failed inserts' section of
the pipeline (as the insert should have been successful).

Have I misunderstood what is being mutated?

Thanks,
Joe

On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <rarokni@gmail.com wrote:

> Hi Joe,
>
> You may find some of the info in this blog of interest, its based on
> streaming pipelines but useful ideas.
>
>
> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>
> Cheers
>
> Reza
>
> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <jo...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I have a pipeline reading CSV files, performing some transforms, and
>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
>> separate JSON file. If the CSV files had a new column added (and I wanted
>> to include this column in the resultant BigQuery table), I'd have to change
>> the JSON schema or the pipeline itself. Is there any way to autodetect the
>> schema using BigQueryIO? How do people normally deal with potential changes
>> to input CSVs?
>>
>> Thanks,
>> Joe
>>
>

Re: Inferring Csv Schemas

Posted by Reza Ardeshir Rokni <ra...@gmail.com>.

Hi Joe,

You may find some of the info in this blog of interest, its based on
streaming pipelines but useful ideas.

https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix

Cheers

Reza

On Thu, 29 Nov 2018 at 06:53, Joe Cullen <jo...@gmail.com> wrote:

> Hi all,
>
> I have a pipeline reading CSV files, performing some transforms, and
> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
> separate JSON file. If the CSV files had a new column added (and I wanted
> to include this column in the resultant BigQuery table), I'd have to change
> the JSON schema or the pipeline itself. Is there any way to autodetect the
> schema using BigQueryIO? How do people normally deal with potential changes
> to input CSVs?
>
> Thanks,
> Joe
>