You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by GitBox <gi...@apache.org> on 2020/06/09 18:27:10 UTC

[GitHub] [systemml] muehlburger opened a new pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

muehlburger opened a new pull request #971:
URL: https://github.com/apache/systemml/pull/971


   This pull requests adds a first protobuf support to read and write FrameBlocks to HDFS.
   
   Appreciate any feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on pull request #971:
URL: https://github.com/apache/systemml/pull/971#issuecomment-649504796


   Thank you! It was a pleasure!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r438322892



##########
File path: pom.xml
##########
@@ -1021,6 +1021,25 @@
 			<artifactId>maven-gpg-plugin</artifactId>
 			<version>1.6</version>
 		</dependency>
+
+		<!-- https://github.com/protocolbuffers/protobuf/tree/master/java -->
+		<dependency>
+			<groupId>com.google.protobuf</groupId>
+			<artifactId>protobuf-java</artifactId>
+			<version>3.11.0</version>

Review comment:
       is fixed

##########
File path: src/main/java/org/apache/sysds/parser/dml/DmlBaseListener.java
##########
@@ -1,4 +1,4 @@
-// Generated from org\apache\sysds\parser\dml\Dml.g4 by ANTLR 4.5.3
+// Generated from org/apache/sysds/parser/dml/Dml.g4 by ANTLR 4.5.3

Review comment:
       should be fixed now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r438064395



##########
File path: src/test/java/org/apache/sysds/test/functions/io/proto/FrameReaderWriterProtoTest.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.io.proto;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.Random;
+
+import org.apache.commons.lang3.tuple.Pair;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.io.FrameReader;
+import org.apache.sysds.runtime.io.FrameReaderFactory;
+import org.apache.sysds.runtime.io.FrameWriter;
+import org.apache.sysds.runtime.io.FrameWriterFactory;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.DataConverter;
+import org.apache.sysds.test.TestUtils;
+import org.apache.wink.json4j.JSONException;
+import org.junit.Test;
+
+public class FrameReaderWriterProtoTest
+{
+	private static final String FILENAME_SINGLE = "target/testTemp/functions/data/FrameReaderWriterProtoTest/testFrameBlock.proto";
+
+	@Test
+	public void testWriteReadFrameBlockSingleSingleFromHDFS() throws IOException, JSONException {
+		testWriteReadFrameBlockComplete(1,1, 4669201);
+	}

Review comment:
       I refactored and simplified the tests:
   - removed unnecessary JSONException which is never thrown in this code (so it does not make sense to use inheritance any more).
   - simplified method calls
   
   With the current implementation it should be clear what happens in every test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r437972257



##########
File path: src/main/resources/protobuf/Frame.proto
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+package sysds;
+option java_package = "org.apache.sysds.protobuf";
+option java_outer_classname = "SysdsProtos";
+option java_multiple_files = false;
+
+message Frame {
+    repeated Row rows = 1;
+}
+
+message Row {
+    repeated string column_names = 1;
+    repeated string column_data = 2;
+    repeated Schema column_schema = 3;
+}
+
+message Schema {
+    enum ValueType {
+        FP32 = 0;
+        FP64 = 1;
+        INT32 = 2;
+        INT64 = 3;
+        BOOLEAN = 4;
+        STRING = 5;
+        UNKNOWN = 6;

Review comment:
       I need this enum specified in this .proto File. This is used to generate the language specific (in our case Java) glue code.
   
   This list is taken from [ValueType](https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysds/common/Types.java#L72)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r437983995



##########
File path: src/test/java/org/apache/sysds/test/functions/io/proto/FrameReaderWriterProtoTest.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.io.proto;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.Random;
+
+import org.apache.commons.lang3.tuple.Pair;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.io.FrameReader;
+import org.apache.sysds.runtime.io.FrameReaderFactory;
+import org.apache.sysds.runtime.io.FrameWriter;
+import org.apache.sysds.runtime.io.FrameWriterFactory;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.DataConverter;
+import org.apache.sysds.test.TestUtils;
+import org.apache.wink.json4j.JSONException;
+import org.junit.Test;
+
+public class FrameReaderWriterProtoTest
+{
+	private static final String FILENAME_SINGLE = "target/testTemp/functions/data/FrameReaderWriterProtoTest/testFrameBlock.proto";
+
+	@Test
+	public void testWriteReadFrameBlockSingleSingleFromHDFS() throws IOException, JSONException {
+		testWriteReadFrameBlockComplete(1,1, 4669201);
+	}

Review comment:
       Good point, will refactor that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] Baunsgaard commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r437914794



##########
File path: src/main/java/org/apache/sysds/parser/dml/DmlBaseListener.java
##########
@@ -1,4 +1,4 @@
-// Generated from org\apache\sysds\parser\dml\Dml.g4 by ANTLR 4.5.3
+// Generated from org/apache/sysds/parser/dml/Dml.g4 by ANTLR 4.5.3

Review comment:
       These files change because they are generated on your system not windows.
   If you revert the change on you branch in a new commit, and thereafter look inside our git ignore.
   then it will show how to ignore these files for future commits :).
   
   Same comment for all `src/main/java/org/apache/sysds/parser/dml/XXX.java` files.

##########
File path: pom.xml
##########
@@ -1021,6 +1021,25 @@
 			<artifactId>maven-gpg-plugin</artifactId>
 			<version>1.6</version>
 		</dependency>
+
+		<!-- https://github.com/protocolbuffers/protobuf/tree/master/java -->
+		<dependency>
+			<groupId>com.google.protobuf</groupId>
+			<artifactId>protobuf-java</artifactId>
+			<version>3.11.0</version>

Review comment:
       It is not 100% updated on their GitHub.
   cold you try the 3.12.X out?
   <https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java>
   
   

##########
File path: src/main/java/org/apache/sysds/runtime/io/FrameReaderFactory.java
##########
@@ -58,6 +58,9 @@ public static FrameReader createFrameReader( FileFormat fmt, FileFormatPropertie
 				else
 					reader = new FrameReaderBinaryBlock();
 				break;
+			case PROTO:
+				reader = new FrameReaderProto();
+				break;

Review comment:
       I know it is out of scope, but could you check out if a parallel readers is possible, if so you don't have to add it but just write a TODO specifying it is, or if not write not possible and reason?

##########
File path: src/test/java/org/apache/sysds/test/functions/io/proto/FrameReaderWriterProtoTest.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.io.proto;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.Random;
+
+import org.apache.commons.lang3.tuple.Pair;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.io.FrameReader;
+import org.apache.sysds.runtime.io.FrameReaderFactory;
+import org.apache.sysds.runtime.io.FrameWriter;
+import org.apache.sysds.runtime.io.FrameWriterFactory;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.DataConverter;
+import org.apache.sysds.test.TestUtils;
+import org.apache.wink.json4j.JSONException;
+import org.junit.Test;
+
+public class FrameReaderWriterProtoTest
+{
+	private static final String FILENAME_SINGLE = "target/testTemp/functions/data/FrameReaderWriterProtoTest/testFrameBlock.proto";
+
+	@Test
+	public void testWriteReadFrameBlockSingleSingleFromHDFS() throws IOException, JSONException {
+		testWriteReadFrameBlockComplete(1,1, 4669201);
+	}

Review comment:
       Since these tests are the same as in `i0/json/freameReaderWriterJSONLTest`.
   You could make it into a class that extends and overwrites the json FrameReaderWriter, or another super class for both of them. 

##########
File path: src/main/java/org/apache/sysds/protobuf/SysdsProtos.java
##########
@@ -0,0 +1,2985 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+// Generated by the protocol buffer compiler.  DO NOT EDIT!
+// source: src/main/resources/protobuf/Frame.proto
+
+package org.apache.sysds.protobuf;
+
+public final class SysdsProtos {
+  private SysdsProtos() {}

Review comment:
       - If you change indentation to Tabs that would be great I don't know if it is possible while it is auto generated.. 
   - If you write the Generated in the first line above license then we have the same format as the DML files auto-generated
   - Also we have a code-style file in `/dev/docs/CodeStyle_eclipse.xml`, that could be applied again if possible

##########
File path: src/main/resources/protobuf/Frame.proto
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+package sysds;
+option java_package = "org.apache.sysds.protobuf";
+option java_outer_classname = "SysdsProtos";
+option java_multiple_files = false;
+
+message Frame {
+    repeated Row rows = 1;
+}
+
+message Row {
+    repeated string column_names = 1;
+    repeated string column_data = 2;
+    repeated Schema column_schema = 3;
+}
+
+message Schema {
+    enum ValueType {
+        FP32 = 0;
+        FP64 = 1;
+        INT32 = 2;
+        INT64 = 3;
+        BOOLEAN = 4;
+        STRING = 5;
+        UNKNOWN = 6;

Review comment:
       Is this list somewhat the same as in `src/main/python/systemds/onnx_systemds/onnx_helper.py`? 
   or rather is there some predetermined specific schema needed, and why is this specific one used here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] Baunsgaard commented on pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #971:
URL: https://github.com/apache/systemml/pull/971#issuecomment-645418101


   Also A nice SystemDS name of the PR and add it to a task inside `/docs/dev/task.txt`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] asfgit closed pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #971:
URL: https://github.com/apache/systemml/pull/971


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] Baunsgaard commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r441597515



##########
File path: src/main/java/org/apache/sysds/runtime/io/FrameWriterFactory.java
##########
@@ -55,6 +55,10 @@ public static FrameWriter createFrameWriter( FileFormat fmt, FileFormatPropertie
 				else
 					writer = new FrameWriterBinaryBlock();
 				break;
+
+			case PROTO:
+				writer = new FrameWriterProto();

Review comment:
       maybe add a parallel todo again.

##########
File path: src/main/java/org/apache/sysds/runtime/io/FrameWriterProto.java
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.io;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.Arrays;
+import java.util.Iterator;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.sysds.conf.ConfigurationManager;
+import org.apache.sysds.protobuf.SysdsProtos;
+import org.apache.sysds.runtime.DMLRuntimeException;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.HDFSTool;
+
+public class FrameWriterProto extends FrameWriter {
+	@Override
+	public void writeFrameToHDFS(FrameBlock src, String fname, long rlen, long clen)
+		throws IOException, DMLRuntimeException {
+		// prepare file access
+		JobConf job = new JobConf(ConfigurationManager.getCachedJobConf());
+		Path path = new Path(fname);
+
+		// if the file already exists on HDFS, remove it.
+		HDFSTool.deleteFileIfExistOnHDFS(fname);
+
+		// validity check frame dimensions
+		if(src.getNumRows() != rlen || src.getNumColumns() != clen) {
+			throw new IOException("Frame dimensions mismatch with metadata: " + src.getNumRows() + "x"
+				+ src.getNumColumns() + " vs " + rlen + "x" + clen + ".");
+		}
+
+		writeProtoFrameToHDFS(path, job, src, rlen, clen);
+	}
+
+	protected void writeProtoFrameToHDFS(Path path, JobConf jobConf, FrameBlock src, long rlen, long clen)
+		throws IOException {
+		FileSystem fileSystem = IOUtilFunctions.getFileSystem(path, jobConf);
+		writeProtoFrameToFile(path, fileSystem, src, 0, (int) rlen);
+		IOUtilFunctions.deleteCrcFilesFromLocalFileSystem(fileSystem, path);
+	}
+
+	protected void writeProtoFrameToFile(Path path, FileSystem fileSystem, FrameBlock src, int lowerRowBound,
+		int upperRowBound) throws IOException {
+		// what about > 2GB Protobuf Messages?

Review comment:
       does this mean that >2GB is impossible?

##########
File path: src/main/java/org/apache/sysds/runtime/io/FrameReaderProto.java
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sysds.runtime.io;
+
+import java.io.IOException;
+import java.io.InputStream;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.FileInputFormat;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.conf.ConfigurationManager;
+import org.apache.sysds.protobuf.SysdsProtos;
+import org.apache.sysds.runtime.DMLRuntimeException;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.UtilFunctions;
+
+public class FrameReaderProto extends FrameReader {
+	@Override
+	public FrameBlock readFrameFromHDFS(String fname, Types.ValueType[] schema, String[] names, long rlen, long clen)
+		throws IOException, DMLRuntimeException {

Review comment:
       Runtime Exceptions does not need to be declared as thrown.

##########
File path: src/main/resources/protobuf/Frame.proto
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+syntax = "proto3";
+package sysds;
+option java_package = "org.apache.sysds.protobuf";
+option java_outer_classname = "SysdsProtos";
+option java_multiple_files = false;
+
+message Frame {
+    repeated Row rows = 1;
+}
+
+message Row {
+    repeated string column_names = 1;
+    repeated string column_data = 2;
+    repeated Schema column_schema = 3;
+}
+
+message Schema {
+    enum ValueType {
+        FP32 = 0;
+        FP64 = 1;
+        INT32 = 2;
+        INT64 = 3;
+        BOOLEAN = 4;
+        STRING = 5;
+        UNKNOWN = 6;
+    }
+    repeated ValueType valueType = 1;
+}

Review comment:
       Just to make Git happy, add a single newline




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on a change in pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on a change in pull request #971:
URL: https://github.com/apache/systemml/pull/971#discussion_r445350487



##########
File path: src/main/java/org/apache/sysds/runtime/io/FrameWriterProto.java
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.io;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.Arrays;
+import java.util.Iterator;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.sysds.conf.ConfigurationManager;
+import org.apache.sysds.protobuf.SysdsProtos;
+import org.apache.sysds.runtime.DMLRuntimeException;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.HDFSTool;
+
+public class FrameWriterProto extends FrameWriter {
+	@Override
+	public void writeFrameToHDFS(FrameBlock src, String fname, long rlen, long clen)
+		throws IOException, DMLRuntimeException {
+		// prepare file access
+		JobConf job = new JobConf(ConfigurationManager.getCachedJobConf());
+		Path path = new Path(fname);
+
+		// if the file already exists on HDFS, remove it.
+		HDFSTool.deleteFileIfExistOnHDFS(fname);
+
+		// validity check frame dimensions
+		if(src.getNumRows() != rlen || src.getNumColumns() != clen) {
+			throw new IOException("Frame dimensions mismatch with metadata: " + src.getNumRows() + "x"
+				+ src.getNumColumns() + " vs " + rlen + "x" + clen + ".");
+		}
+
+		writeProtoFrameToHDFS(path, job, src, rlen, clen);
+	}
+
+	protected void writeProtoFrameToHDFS(Path path, JobConf jobConf, FrameBlock src, long rlen, long clen)
+		throws IOException {
+		FileSystem fileSystem = IOUtilFunctions.getFileSystem(path, jobConf);
+		writeProtoFrameToFile(path, fileSystem, src, 0, (int) rlen);
+		IOUtilFunctions.deleteCrcFilesFromLocalFileSystem(fileSystem, path);
+	}
+
+	protected void writeProtoFrameToFile(Path path, FileSystem fileSystem, FrameBlock src, int lowerRowBound,
+		int upperRowBound) throws IOException {
+		// what about > 2GB Protobuf Messages?

Review comment:
       "Protobuf has a hard limit of 2GB, because many implementations use 32-bit signed arithmetic." (see [StackOverflow question](https://stackoverflow.com/questions/34128872/google-protobuf-maximum-size#:~:text=Protobuf%20has%20a%20hard%20limit,manually%20if%20you%20need%20to.)).
   
   I'm not sure if this applies to us, as we use Java and no C++. 
   As every Frame Row is defined as message in [Frame.proto](https://github.com/apache/systemml/blob/0daa501240b14d8eb6cff07a6b7b38017dec0414/src/main/resources/protobuf/Frame.proto#L29) this would mean that the limit applies to every row in a frame.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] Baunsgaard commented on pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #971:
URL: https://github.com/apache/systemml/pull/971#issuecomment-649463903


   Hi @muehlburger !
   Thanks for your contribution! :superhero: :star: 
   I have squashed all your commits into one, and intend to push this to master after the testsuite is done on my fork.
   The only change i made (hopefully) is to add to the comment of the 2GB limit. We will have to see if that is going to be a problem in java as well.
   
   I hope you enjoyed committing to opensource and we hope to see you contributing again another time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemml] muehlburger commented on pull request #971: [WIP] - Add protobuf support to write and read FrameBlocks to HDFS

Posted by GitBox <gi...@apache.org>.
muehlburger commented on pull request #971:
URL: https://github.com/apache/systemml/pull/971#issuecomment-645366329


   @Baunsgaard could you have another look at the changes? Would be happy to get your feedback! :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org