You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Henrique Molina <he...@gmail.com> on 2019/04/15 19:47:16 UTC

Hi, some sample about Extracting data from Xlsx ?

Hello

I would like to use best practices from Apache Beams to read Xlsx. however
I found examples only related with Cs extension.
someone there is sample using ParDo to Collect all columns and sheets from
Excel xlsx ?
Afterwards I will put into google Big query.

Thanks & Regards

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Matt Casters <ma...@gmail.com>.
Kettle indeed uses POI for xlsx but you can configure it in the Excel Input
step.  Kettle on Apache Beam would read the file(s) in a single thread as
discussed earlier on the user beam mailing list.
You can download a version with Beam over here:  http://www.kettle.be/

Cheers,
Matt
---
Matt Casters <m <mc...@gmail.com>
Senior Solution Architect, Kettle Project Founder




Op di 16 apr. 2019 om 23:39 schreef Pablo Estrada <pa...@google.com>:

> Hm I am not very familiar with POI, but if its transforms are able to take
> in a file descriptor, you should be able to use FileIO.match()[0] to find
> your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get
> file descriptors for these files.
>
> If the POI libraries require the files to be local in your machine, you
> may need to use FileSystems.copy[2] to move your files locally, and then
> analyze them.
>
> Let me know if those are some useful building blocks for your pipeline,
> Best
> -P.
>
> [0]
> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match--
>
> [1]
> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches--
> [2]
> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...-
>
>
> On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina <he...@gmail.com>
> wrote:
>
>> Hi Pablo ,
>> Thanks for your attention,
>> I so sorry, my bad written "Cs extension " I did means .csv extension !
>> The example like this: load-csv-file-from-google-cloud-storage
>> <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>
>>
>> I was think Using apache POI to read each row from sheet  throwing to
>> next ParDo an CellRow rows
>> same like that:
>> .apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>()
>> {.....
>>
>> I don't know if it is more ellegant...
>>
>> If your have some Idea ! let me know . it will be welcome!!
>>
>>
>> On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <pa...@google.com> wrote:
>>
>>> Hello Henrique,
>>>
>>> I am not aware of existing Beam transforms specifically used for reading
>>> in XLSX data. Can you share what you mean by "examples related with Cs
>>> extension"?
>>>
>>> I am aware of some Python libraries foir this sort of thing[1]. You
>>> could use the FileIO transforms in the Python SDK to find each file, and
>>> then write a DoFn that is able to read in data from these files. Check out
>>> this unit test using FileIO to read CSV files[2].
>>>
>>> Let me know if that helps, or if I went on the wrong direction of what
>>> you needed.
>>> Best
>>> -P.
>>>
>>> [1] https://openpyxl.readthedocs.io/en/stable/
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>>>
>>> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
>>> henrique.molina@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> I would like to use best practices from Apache Beams to read Xlsx.
>>>> however I found examples only related with Cs extension.
>>>> someone there is sample using ParDo to Collect all columns and sheets
>>>> from Excel xlsx ?
>>>> Afterwards I will put into google Big query.
>>>>
>>>> Thanks & Regards
>>>>
>>>>
>>>

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Pablo Estrada <pa...@google.com>.
Hm I am not very familiar with POI, but if its transforms are able to take
in a file descriptor, you should be able to use FileIO.match()[0] to find
your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get
file descriptors for these files.

If the POI libraries require the files to be local in your machine, you may
need to use FileSystems.copy[2] to move your files locally, and then
analyze them.

Let me know if those are some useful building blocks for your pipeline,
Best
-P.

[0]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match--

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches--
[2]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...-


On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina <he...@gmail.com>
wrote:

> Hi Pablo ,
> Thanks for your attention,
> I so sorry, my bad written "Cs extension " I did means .csv extension !
> The example like this: load-csv-file-from-google-cloud-storage
> <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>
>
> I was think Using apache POI to read each row from sheet  throwing to next
> ParDo an CellRow rows
> same like that:
> .apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>() {.....
>
> I don't know if it is more ellegant...
>
> If your have some Idea ! let me know . it will be welcome!!
>
>
> On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hello Henrique,
>>
>> I am not aware of existing Beam transforms specifically used for reading
>> in XLSX data. Can you share what you mean by "examples related with Cs
>> extension"?
>>
>> I am aware of some Python libraries foir this sort of thing[1]. You could
>> use the FileIO transforms in the Python SDK to find each file, and then
>> write a DoFn that is able to read in data from these files. Check out this
>> unit test using FileIO to read CSV files[2].
>>
>> Let me know if that helps, or if I went on the wrong direction of what
>> you needed.
>> Best
>> -P.
>>
>> [1] https://openpyxl.readthedocs.io/en/stable/
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>>
>> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
>> henrique.molina@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I would like to use best practices from Apache Beams to read Xlsx.
>>> however I found examples only related with Cs extension.
>>> someone there is sample using ParDo to Collect all columns and sheets
>>> from Excel xlsx ?
>>> Afterwards I will put into google Big query.
>>>
>>> Thanks & Regards
>>>
>>>
>>

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Pablo Estrada <pa...@google.com>.
Hm I am not very familiar with POI, but if its transforms are able to take
in a file descriptor, you should be able to use FileIO.match()[0] to find
your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get
file descriptors for these files.

If the POI libraries require the files to be local in your machine, you may
need to use FileSystems.copy[2] to move your files locally, and then
analyze them.

Let me know if those are some useful building blocks for your pipeline,
Best
-P.

[0]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match--

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches--
[2]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...-


On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina <he...@gmail.com>
wrote:

> Hi Pablo ,
> Thanks for your attention,
> I so sorry, my bad written "Cs extension " I did means .csv extension !
> The example like this: load-csv-file-from-google-cloud-storage
> <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>
>
> I was think Using apache POI to read each row from sheet  throwing to next
> ParDo an CellRow rows
> same like that:
> .apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>() {.....
>
> I don't know if it is more ellegant...
>
> If your have some Idea ! let me know . it will be welcome!!
>
>
> On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hello Henrique,
>>
>> I am not aware of existing Beam transforms specifically used for reading
>> in XLSX data. Can you share what you mean by "examples related with Cs
>> extension"?
>>
>> I am aware of some Python libraries foir this sort of thing[1]. You could
>> use the FileIO transforms in the Python SDK to find each file, and then
>> write a DoFn that is able to read in data from these files. Check out this
>> unit test using FileIO to read CSV files[2].
>>
>> Let me know if that helps, or if I went on the wrong direction of what
>> you needed.
>> Best
>> -P.
>>
>> [1] https://openpyxl.readthedocs.io/en/stable/
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>>
>> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
>> henrique.molina@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I would like to use best practices from Apache Beams to read Xlsx.
>>> however I found examples only related with Cs extension.
>>> someone there is sample using ParDo to Collect all columns and sheets
>>> from Excel xlsx ?
>>> Afterwards I will put into google Big query.
>>>
>>> Thanks & Regards
>>>
>>>
>>

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Henrique Molina <he...@gmail.com>.
Hi Pablo ,
Thanks for your attention,
I so sorry, my bad written "Cs extension " I did means .csv extension !
The example like this: load-csv-file-from-google-cloud-storage
<https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>

I was think Using apache POI to read each row from sheet  throwing to next
ParDo an CellRow rows
same like that:
.apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>() {.....

I don't know if it is more ellegant...

If your have some Idea ! let me know . it will be welcome!!


On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <pa...@google.com> wrote:

> Hello Henrique,
>
> I am not aware of existing Beam transforms specifically used for reading
> in XLSX data. Can you share what you mean by "examples related with Cs
> extension"?
>
> I am aware of some Python libraries foir this sort of thing[1]. You could
> use the FileIO transforms in the Python SDK to find each file, and then
> write a DoFn that is able to read in data from these files. Check out this
> unit test using FileIO to read CSV files[2].
>
> Let me know if that helps, or if I went on the wrong direction of what you
> needed.
> Best
> -P.
>
> [1] https://openpyxl.readthedocs.io/en/stable/
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>
> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
> henrique.molina@gmail.com> wrote:
>
>> Hello
>>
>> I would like to use best practices from Apache Beams to read Xlsx.
>> however I found examples only related with Cs extension.
>> someone there is sample using ParDo to Collect all columns and sheets
>> from Excel xlsx ?
>> Afterwards I will put into google Big query.
>>
>> Thanks & Regards
>>
>>
>

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Henrique Molina <he...@gmail.com>.
Hi Pablo ,
Thanks for your attention,
I so sorry, my bad written "Cs extension " I did means .csv extension !
The example like this: load-csv-file-from-google-cloud-storage
<https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>

I was think Using apache POI to read each row from sheet  throwing to next
ParDo an CellRow rows
same like that:
.apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>() {.....

I don't know if it is more ellegant...

If your have some Idea ! let me know . it will be welcome!!


On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <pa...@google.com> wrote:

> Hello Henrique,
>
> I am not aware of existing Beam transforms specifically used for reading
> in XLSX data. Can you share what you mean by "examples related with Cs
> extension"?
>
> I am aware of some Python libraries foir this sort of thing[1]. You could
> use the FileIO transforms in the Python SDK to find each file, and then
> write a DoFn that is able to read in data from these files. Check out this
> unit test using FileIO to read CSV files[2].
>
> Let me know if that helps, or if I went on the wrong direction of what you
> needed.
> Best
> -P.
>
> [1] https://openpyxl.readthedocs.io/en/stable/
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>
> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
> henrique.molina@gmail.com> wrote:
>
>> Hello
>>
>> I would like to use best practices from Apache Beams to read Xlsx.
>> however I found examples only related with Cs extension.
>> someone there is sample using ParDo to Collect all columns and sheets
>> from Excel xlsx ?
>> Afterwards I will put into google Big query.
>>
>> Thanks & Regards
>>
>>
>

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Pablo Estrada <pa...@google.com>.
Hello Henrique,

I am not aware of existing Beam transforms specifically used for reading in
XLSX data. Can you share what you mean by "examples related with Cs
extension"?

I am aware of some Python libraries foir this sort of thing[1]. You could
use the FileIO transforms in the Python SDK to find each file, and then
write a DoFn that is able to read in data from these files. Check out this
unit test using FileIO to read CSV files[2].

Let me know if that helps, or if I went on the wrong direction of what you
needed.
Best
-P.

[1] https://openpyxl.readthedocs.io/en/stable/
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148

On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <he...@gmail.com>
wrote:

> Hello
>
> I would like to use best practices from Apache Beams to read Xlsx. however
> I found examples only related with Cs extension.
> someone there is sample using ParDo to Collect all columns and sheets from
> Excel xlsx ?
> Afterwards I will put into google Big query.
>
> Thanks & Regards
>
>

Re: Hi, some sample about Extracting data from Xlsx ?

Posted by Pablo Estrada <pa...@google.com>.
Hello Henrique,

I am not aware of existing Beam transforms specifically used for reading in
XLSX data. Can you share what you mean by "examples related with Cs
extension"?

I am aware of some Python libraries foir this sort of thing[1]. You could
use the FileIO transforms in the Python SDK to find each file, and then
write a DoFn that is able to read in data from these files. Check out this
unit test using FileIO to read CSV files[2].

Let me know if that helps, or if I went on the wrong direction of what you
needed.
Best
-P.

[1] https://openpyxl.readthedocs.io/en/stable/
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148

On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <he...@gmail.com>
wrote:

> Hello
>
> I would like to use best practices from Apache Beams to read Xlsx. however
> I found examples only related with Cs extension.
> someone there is sample using ParDo to Collect all columns and sheets from
> Excel xlsx ?
> Afterwards I will put into google Big query.
>
> Thanks & Regards
>
>