You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2022/09/17 21:23:05 UTC

[GitHub] [systemds] Baunsgaard opened a new pull request, #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Baunsgaard opened a new pull request, #1697:
URL: https://github.com/apache/systemds/pull/1697

   This commit adds the basic blocks for writing a compressed matrix to disk, and adds a basic test for the case of writing a matrix and read it back from disk.
   
   Further testing and full integration into DML is needed, and a mechanism to detect if the format of the compression groups have changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250143724

   @mboehm7 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] mboehm7 commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250322402

   Well, regarding the overall design I would recommend to follow the existing binary format. We write sequence files of key-value (index-block) pairs from both local and distributed writers such that the files can be read in any execution mode. Right now it seems you directly serialize the entire block, similar to what the buffer pool eviction did. 
   
   A version ID at the binning of the file/blocks is fine but we should strive to keep the file layout static, except for new encoding schemes this should be possible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] mboehm7 commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250339190

   Please do not add these special cases / workarounds to the compiler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard closed pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix
URL: https://github.com/apache/systemds/pull/1697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250345011

   Okay let's see what i can do. I have a few ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250334572

   I was thinking of the index-block design used in binary, but since the compression framework compress an entire matrix in CP i would have to decompose the compression into multiple blocks if we want this and, in CP, reading would have to combine them again. 
   
   I think this overcomplicate things unless we somehow make the compression able to combine different blocks with the same compression plan. 
   
   Furthermore if we write a compressed distributed block-indexed matrix to disk we get multiple blocks with different formats that would not be able to combine nicely in CP anyway. Enforcing that such a read should lead to SP instructions.
   
   In the end the problems make reading and writing the same way as binary blocks a bit challenging especially if you want same behavior.
   But i can suggest we always treat the compressed format as an index-block based file with a block size >= nCols && nRows ;)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] mboehm7 commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250343324

   Thanks - as I said, strive for clear semantics first and don't worry about performance/suboptimal compression ratios. Writing out b x b blocks according to the CP compression scheme is fine (with splitting of column groups across block boundaries). When reading b x b compressed blocks, take the compression plan of the first blocks that touch individual columns, and then merge the remaining blocks in. Once the initial version is ready and fully operational, we can talk about performance to minimize reallocations, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250340915

   Agree. Hence i was asking for suggestions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250345974

   Thanks for the help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on pull request #1697: [SYSTEMDS-2699] CLA IO Compressed Matrix

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on PR #1697:
URL: https://github.com/apache/systemds/pull/1697#issuecomment-1250143469

   Since the compression format have a tendency to change a bit the files written will not be fully supported at all times across different versions. A suggestion to detect changes or incompatible version numbers is to write a identifier  to the files in the beginning, 
   
   - GitHash 
   - SystemDS version Number 
   
   Since GitHash is not available at all times we could use SystemDS version number as a fall back. I do not personally like either solution maybe someone else have some suggestions?
   
   Other design decisions:
   
   1. For distributed i intend to simply write each compressed block in different files like we already do.
   2. Parallel reading and writing could be made with many files, for instance i could split each each column group into a separate file instead of multiple blocks, perhaps someone have some experience or ideas?
   
   Help / Comments appreciated
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org