You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org> on 2019/01/13 06:58:13 UTC

transpose CSV transform

Hi all,

I am working with many CSV files where the common part is the row names and
therefore, my processing should be by columns. My plan is to have the
tables transposed and have the combines tables written into BQ.
So , the code should perform:
1. transpose the tables (columns -> new_rows, rows->new_columns). new_rows
x new_columns = new_table
2. extract the new_rows values from the new_tables and write them to big
query.

Is there an easy way to transpose the CSV files? I am avoiding the usage of
pandas library because the size of the tables could be very large. should I
be concern by the table size. Is this consideration relevant or should the
Apache Beam be able to handle the resources for the pandas?

What is my other option? is there any built in transpose method that I am
not aware of?

Thanks for your help,
-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Re: transpose CSV transform

Posted by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org>.
Thanks! very helpful.
Eila

On Mon, Jan 14, 2019 at 4:35 AM Robert Bradshaw <ro...@google.com> wrote:

> I am not aware of any built-in transform that can do this, however it
> should not be that difficult to do with a group-by-key.
>
> Suppose one reads in the CSV file to a PCollection of dictionaries of the
> format {'original_column_1': value1, 'original_column_2', value2, ...}.
> Suppose further that original_column_N is the index column (which is what
> will become the new column names). To compute the transpose you can use the
> PTransform
>
> class Transpose(beam.PTransform):
>     def __init__(self, index_column):
>         self._index_column = index_column
>     def expand(self, pcoll):
>         return (pcoll
>            # Map to tuples of the form (column_name, (index, value))
>             | beam.FlatMap(lambda original_row, ix_col: [
>                 (col, (original_row[ix_col], value))
>                 for col, value in original_row.items()
>                 if col != ix_col], self._index_column)
>             # Group all values for a column together.
>             | beam.GroupByKey()
>             # Map to dictionaries of the form {'index': value}
>             | beam.Map(lambda (col, values): dict(values,
> original_column_name=col)))
>
> You can then apply this to your pcollection by writing
>
> transposed_pcoll = pcoll | Transpose('original_column_N')
>
>
> On Sun, Jan 13, 2019 at 5:19 PM Sameer Abhyankar <sa...@google.com>
> wrote:
>
>> Hi Eila - While I am not aware of a transpose transform available for CSV
>> files, there is a sample pipeline available to transpose a BigQuery table
>> and write the results to a different table[1]. It might be possible to
>> modify this to work on a CSV source.
>>
>> [1]
>> https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose
>>
>>
>> On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Arich-Landkof <
>> eila@orielresearch.org> wrote:
>>
>>> Hi all,
>>>
>>> I am working with many CSV files where the common part is the row names
>>> and therefore, my processing should be by columns. My plan is to have the
>>> tables transposed and have the combines tables written into BQ.
>>> So , the code should perform:
>>> 1. transpose the tables (columns -> new_rows, rows->new_columns).
>>> new_rows x new_columns = new_table
>>> 2. extract the new_rows values from the new_tables and write them to big
>>> query.
>>>
>>> Is there an easy way to transpose the CSV files? I am avoiding the usage
>>> of pandas library because the size of the tables could be very large.
>>> should I be concern by the table size. Is this consideration relevant or
>>> should the Apache Beam be able to handle the resources for the pandas?
>>>
>>> What is my other option? is there any built in transpose method that I
>>> am not aware of?
>>>
>>> Thanks for your help,
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>>> m/Deep-Learning-In-Production/
>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>
>>>
>>>

-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Re: transpose CSV transform

Posted by Robert Bradshaw <ro...@google.com>.
I am not aware of any built-in transform that can do this, however it
should not be that difficult to do with a group-by-key.

Suppose one reads in the CSV file to a PCollection of dictionaries of the
format {'original_column_1': value1, 'original_column_2', value2, ...}.
Suppose further that original_column_N is the index column (which is what
will become the new column names). To compute the transpose you can use the
PTransform

class Transpose(beam.PTransform):
    def __init__(self, index_column):
        self._index_column = index_column
    def expand(self, pcoll):
        return (pcoll
           # Map to tuples of the form (column_name, (index, value))
            | beam.FlatMap(lambda original_row, ix_col: [
                (col, (original_row[ix_col], value))
                for col, value in original_row.items()
                if col != ix_col], self._index_column)
            # Group all values for a column together.
            | beam.GroupByKey()
            # Map to dictionaries of the form {'index': value}
            | beam.Map(lambda (col, values): dict(values,
original_column_name=col)))

You can then apply this to your pcollection by writing

transposed_pcoll = pcoll | Transpose('original_column_N')


On Sun, Jan 13, 2019 at 5:19 PM Sameer Abhyankar <sa...@google.com>
wrote:

> Hi Eila - While I am not aware of a transpose transform available for CSV
> files, there is a sample pipeline available to transpose a BigQuery table
> and write the results to a different table[1]. It might be possible to
> modify this to work on a CSV source.
>
> [1]
> https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose
>
>
> On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Arich-Landkof <
> eila@orielresearch.org> wrote:
>
>> Hi all,
>>
>> I am working with many CSV files where the common part is the row names
>> and therefore, my processing should be by columns. My plan is to have the
>> tables transposed and have the combines tables written into BQ.
>> So , the code should perform:
>> 1. transpose the tables (columns -> new_rows, rows->new_columns).
>> new_rows x new_columns = new_table
>> 2. extract the new_rows values from the new_tables and write them to big
>> query.
>>
>> Is there an easy way to transpose the CSV files? I am avoiding the usage
>> of pandas library because the size of the tables could be very large.
>> should I be concern by the table size. Is this consideration relevant or
>> should the Apache Beam be able to handle the resources for the pandas?
>>
>> What is my other option? is there any built in transpose method that I am
>> not aware of?
>>
>> Thanks for your help,
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>> m/Deep-Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>

Re: transpose CSV transform

Posted by Sameer Abhyankar <sa...@google.com>.
Hi Eila - While I am not aware of a transpose transform available for CSV
files, there is a sample pipeline available to transpose a BigQuery table
and write the results to a different table[1]. It might be possible to
modify this to work on a CSV source.

[1]
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose


On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Arich-Landkof <
eila@orielresearch.org> wrote:

> Hi all,
>
> I am working with many CSV files where the common part is the row names
> and therefore, my processing should be by columns. My plan is to have the
> tables transposed and have the combines tables written into BQ.
> So , the code should perform:
> 1. transpose the tables (columns -> new_rows, rows->new_columns). new_rows
> x new_columns = new_table
> 2. extract the new_rows values from the new_tables and write them to big
> query.
>
> Is there an easy way to transpose the CSV files? I am avoiding the usage
> of pandas library because the size of the tables could be very large.
> should I be concern by the table size. Is this consideration relevant or
> should the Apache Beam be able to handle the resources for the pandas?
>
> What is my other option? is there any built in transpose method that I am
> not aware of?
>
> Thanks for your help,
> --
> Eila
> www.orielresearch.org
> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
> m/Deep-Learning-In-Production/
> <https://www.meetup.com/Deep-Learning-In-Production/>
>
>
>