You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by OrielResearch Eila Arich-Landkof <ei...@orielresearch.org> on 2019/02/09 18:49:12 UTC
Re: transpose CSV transform

Thanks! very helpful.
Eila

On Mon, Jan 14, 2019 at 4:35 AM Robert Bradshaw <ro...@google.com> wrote:

> I am not aware of any built-in transform that can do this, however it
> should not be that difficult to do with a group-by-key.
>
> Suppose one reads in the CSV file to a PCollection of dictionaries of the
> format {'original_column_1': value1, 'original_column_2', value2, ...}.
> Suppose further that original_column_N is the index column (which is what
> will become the new column names). To compute the transpose you can use the
> PTransform
>
> class Transpose(beam.PTransform):
>     def __init__(self, index_column):
>         self._index_column = index_column
>     def expand(self, pcoll):
>         return (pcoll
>            # Map to tuples of the form (column_name, (index, value))
>             | beam.FlatMap(lambda original_row, ix_col: [
>                 (col, (original_row[ix_col], value))
>                 for col, value in original_row.items()
>                 if col != ix_col], self._index_column)
>             # Group all values for a column together.
>             | beam.GroupByKey()
>             # Map to dictionaries of the form {'index': value}
>             | beam.Map(lambda (col, values): dict(values,
> original_column_name=col)))
>
> You can then apply this to your pcollection by writing
>
> transposed_pcoll = pcoll | Transpose('original_column_N')
>
>
> On Sun, Jan 13, 2019 at 5:19 PM Sameer Abhyankar <sa...@google.com>
> wrote:
>
>> Hi Eila - While I am not aware of a transpose transform available for CSV
>> files, there is a sample pipeline available to transpose a BigQuery table
>> and write the results to a different table[1]. It might be possible to
>> modify this to work on a CSV source.
>>
>> [1]
>> https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose
>>
>>
>> On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Arich-Landkof <
>> eila@orielresearch.org> wrote:
>>
>>> Hi all,
>>>
>>> I am working with many CSV files where the common part is the row names
>>> and therefore, my processing should be by columns. My plan is to have the
>>> tables transposed and have the combines tables written into BQ.
>>> So , the code should perform:
>>> 1. transpose the tables (columns -> new_rows, rows->new_columns).
>>> new_rows x new_columns = new_table
>>> 2. extract the new_rows values from the new_tables and write them to big
>>> query.
>>>
>>> Is there an easy way to transpose the CSV files? I am avoiding the usage
>>> of pandas library because the size of the tables could be very large.
>>> should I be concern by the table size. Is this consideration relevant or
>>> should the Apache Beam be able to handle the resources for the pandas?
>>>
>>> What is my other option? is there any built in transpose method that I
>>> am not aware of?
>>>
>>> Thanks for your help,
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>>> m/Deep-Learning-In-Production/
>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>
>>>
>>>

-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>