You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Eila Oriel Research <ei...@orielresearch.org> on 2021/12/22 11:37:11 UTC

Creating a dense matrix from sparse matrix using apache beam

Hi all,

I am working witn  Market Exchange Format (MEX)
A quick explanation:
it is a method to save high dimensional sparse matrix values into smaller
files.
It includes 3 files:
- rows names with indexes file (name_r1,1) (name_r2,2)
- column names and indexes file (name_c1,1) (name_c2,2)
- values file: coordinates (indexes) of rows and columns and the value in
the matrix.
1,2,5 (value in row 1 col 2 is 5)

The values file is large and zipped to save space.
I am reading the values files line by line and filtering out the data that
I am interested in based on the coordinate values and save it to a python
dataframe. of course, it takes forever.

Is there a way to do it with apache beam?
I know how to read the values from a zip file and use DoFn to filter the
relevant values. The sink part is not clear to me. What will be the best
way to sink the data easily?

Please let me know what you think

Best,
-- 

Eila, Founder & CEO

Check out ORT’s new blog <https://www.orielresearch.com/blog>

Linkedin  <https://www.linkedin.com/company/oriel-research-therapeutics/>

Re: Creating a dense matrix from sparse matrix using apache beam

Posted by Brian Hulette <bh...@google.com>.
I'm assuming you're using the Python SDK since you're talking about python
dataframes, let me know if that's not the case.
This sounds like something the DataFrame API [1] may be able to help with,
it supports using common pandas sinks (e.g. to_parquet [2]). Would it be
acceptable if you could partition the dataset by row to process the data
across multiple nodes and write out separate files? For that to work we'd
need to know the column names when you're constructing the pipeline, and we
can't partition by column, only by row. So if the data is very wide or you
can't read the column names independently that approach may not work.

Some other ideas:
- If the data is too wide for the DataFrame API approach, you may be able
to use ParquetIO [3] to write a distributed parquet dataset which is
amenable to loading into a dataframe later.
- If you're using interactive beam in a Notebook you can use ib.collect()
[4] to bring all of the data in a PCollection into notebook memory as a
pandas DataFrame.

Brian

[1] https://beam.apache.org/documentation/dsls/dataframes/overview/
[2]
https://beam.apache.org/releases/pydoc/2.34.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame.to_parquet
[3]
https://beam.apache.org/releases/pydoc/2.34.0/apache_beam.io.parquetio.html
[4]
https://beam.apache.org/releases/pydoc/2.34.0/apache_beam.runners.interactive.interactive_beam.html#apache_beam.runners.interactive.interactive_beam.collect

On Wed, Dec 22, 2021 at 3:37 AM Eila Oriel Research <ei...@orielresearch.org>
wrote:

> Hi all,
>
> I am working witn  Market Exchange Format (MEX)
> A quick explanation:
> it is a method to save high dimensional sparse matrix values into smaller
> files.
> It includes 3 files:
> - rows names with indexes file (name_r1,1) (name_r2,2)
> - column names and indexes file (name_c1,1) (name_c2,2)
> - values file: coordinates (indexes) of rows and columns and the value in
> the matrix.
> 1,2,5 (value in row 1 col 2 is 5)
>
> The values file is large and zipped to save space.
> I am reading the values files line by line and filtering out the data that
> I am interested in based on the coordinate values and save it to a python
> dataframe. of course, it takes forever.
>
> Is there a way to do it with apache beam?
> I know how to read the values from a zip file and use DoFn to filter the
> relevant values. The sink part is not clear to me. What will be the best
> way to sink the data easily?
>
> Please let me know what you think
>
> Best,
> --
>
> Eila, Founder & CEO
>
> Check out ORT’s new blog <https://www.orielresearch.com/blog>
>
> Linkedin  <https://www.linkedin.com/company/oriel-research-therapeutics/>
>