You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Unais T <tp...@gmail.com> on 2018/11/25 12:09:35 UTC

Reading CSV from google cloud storage to Data Flow

Hey guys,

One doubt

I want to read a csv file from google cloud storage to Data Flow
which is best method?

1.   Read csv and sync to BQ and then use BigQuerySource method
2.   Read from cloud storage directly to Data Flow (Is there any source
method for csv from cloud storage to CSV - like `ReadFromText` )

Whats the best way to read csv from cloud storage to Data Flow?

Re: Reading CSV from google cloud storage to Data Flow

Posted by Robert Bradshaw <ro...@google.com>.
The same holds true in Python: Read the files with TextIO and follow with a
Map operation that splits the lines into records.

This, of course, only works if you don't have newlines within your records.
In that case, you may need to use a DoFn that takes as input a each
filename and reads the entire file (e.g. using the standard library csv
parsers), emitting the records (possibly followed by a Reshuffle), e.g.

(p
 | beam.Create([list of filenames])
 | beam.FlatMap(lambda path: csv.reader(open(path)))
 | beam.Reshuffle()
 | ...)

If your files are too big to read in a single mapper *and* have newlines,
you may have to implement something like
https://blog.etleap.com/2016/11/27/distributed-csv-parsing/


On Sun, Nov 25, 2018 at 2:29 PM Unais T <tp...@gmail.com> wrote:

> Python
>
> On Sun, Nov 25, 2018 at 4:54 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi Unais,
>>
>> What SDK do you plan to use ? Java or Python ?
>>
>> Regarding Java, I would use directly TextIO.
>>
>> Regards
>> JB
>>
>> On 25/11/2018 13:09, Unais T wrote:
>> > Hey guys,
>> >
>> > One doubt
>> >
>> > I want to read a csv file from google cloud storage to Data Flow
>> > which is best method?
>> >
>> > 1.   Read csv and sync to BQ and then use BigQuerySource method
>> > 2.   Read from cloud storage directly to Data Flow (Is there any source
>> > method for csv from cloud storage to CSV - like `ReadFromText` )
>> >
>> > Whats the best way to read csv from cloud storage to Data Flow?
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: Reading CSV from google cloud storage to Data Flow

Posted by Unais T <tp...@gmail.com>.
Python

On Sun, Nov 25, 2018 at 4:54 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Unais,
>
> What SDK do you plan to use ? Java or Python ?
>
> Regarding Java, I would use directly TextIO.
>
> Regards
> JB
>
> On 25/11/2018 13:09, Unais T wrote:
> > Hey guys,
> >
> > One doubt
> >
> > I want to read a csv file from google cloud storage to Data Flow
> > which is best method?
> >
> > 1.   Read csv and sync to BQ and then use BigQuerySource method
> > 2.   Read from cloud storage directly to Data Flow (Is there any source
> > method for csv from cloud storage to CSV - like `ReadFromText` )
> >
> > Whats the best way to read csv from cloud storage to Data Flow?
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Reading CSV from google cloud storage to Data Flow

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Unais,

What SDK do you plan to use ? Java or Python ?

Regarding Java, I would use directly TextIO.

Regards
JB

On 25/11/2018 13:09, Unais T wrote:
> Hey guys,
> 
> One doubt 
> 
> I want to read a csv file from google cloud storage to Data Flow
> which is best method?
> 
> 1.   Read csv and sync to BQ and then use BigQuerySource method
> 2.   Read from cloud storage directly to Data Flow (Is there any source
> method for csv from cloud storage to CSV - like `ReadFromText` )
> 
> Whats the best way to read csv from cloud storage to Data Flow?

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com