You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2022/07/27 01:36:48 UTC

[GitHub] [systemds] fathollahzadeh opened a new pull request, #1672: [GIO] New Implementation of IOGEN

fathollahzadeh opened a new pull request, #1672:
URL: https://github.com/apache/systemds/pull/1672

   This PR is a new implementation of the GIO (generating readers for custom datasets). In the new implementation, removed all hard-coded implementations for flat datasets and replaced them with code gen. One of the primary goals of GIO is to support single and multi-row representations of tuples in source datasets. This PR is for supporting both of them.
   An outline of the PR is:
   
   It supports Matrix and Frame
   It supports nested and flat datasets
   It supports non-standard datasets, i.e., an incomplete number of cols in a CSV file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] fathollahzadeh closed pull request #1672: [GIO] New Implementation of IOGEN

Posted by "fathollahzadeh (via GitHub)" <gi...@apache.org>.
fathollahzadeh closed pull request #1672: [GIO] New Implementation of IOGEN
URL: https://github.com/apache/systemds/pull/1672


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] fathollahzadeh commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
fathollahzadeh commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r954119302


##########
src/main/java/org/apache/sysds/runtime/iogen/FormatIdentifying.java:
##########
@@ -0,0 +1,1270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.iogen;
+
+import org.apache.sysds.lops.Lop;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.matrix.data.MatrixBlock;
+import org.apache.sysds.runtime.matrix.data.Pair;
+
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class FormatIdentifying {

Review Comment:
   Renamed, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r953945189


##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -304,9 +306,9 @@ private ArrayList<Instruction> doPlainInstructionGen(StatementBlock sb, List<Lop
 		
 		// filter out non-executable nodes
 		List<Lop> execNodes = nodes.stream()
-			.filter(l -> (!l.isDataExecLocation() 
+			.filter(l -> (!l.isDataIOGenExecLocation() && !l.isReaderGenExecLocation() && (!l.isDataExecLocation()

Review Comment:
   make a helper method in `l` that get this boolean out, (you do this rewrite multiple times.)
   `!l.isDataIOGenExecLocation() && !l.isReaderGenExecLocation()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] fathollahzadeh commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
fathollahzadeh commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r954074191


##########
src/main/java/org/apache/sysds/runtime/io/FrameReaderJSONL.java:
##########
@@ -128,7 +128,8 @@ private static String getStringFromJSONPath(JSONObject jsonObject, String path)
 
 		}
 		if(temp == null){
-			throw new IOException("Could not traverse the JSON path: '" + path + "'!");
+			return null;
+			//throw new IOException("Could not traverse the JSON path: '" + path + "'!");

Review Comment:
   I have to revert the modification of this file; I did it just for a temporary check and it wrongly committed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] fathollahzadeh commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
fathollahzadeh commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r954006399


##########
src/main/java/org/apache/sysds/hops/DataOp.java:
##########
@@ -283,20 +308,28 @@ public Lop constructLops()
 		for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
 			inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
 		}
+		if(_ioGenRead)
+			inputLops.put("iogenformat", _generateReaderOp.constructLops());
 
 		// Create the lop
 		switch(_op) 
 		{
 			case TRANSIENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
+				if(!_ioGenRead)
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());
 				setOutputDimensions(l);
 				break;
 				
 			case PERSISTENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
-				l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
+				if(!_ioGenRead){
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+					l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
+				}
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());

Review Comment:
   our design for IOGEN is like this:
   read(sample=x, sample_raw=$3,  format=$4, data_type="matrix")
   this read doesn't have any output like the current SystemsDS regular reading.  We are saving the JAVA source code in "format=$4" and due the next readers call that format, it is compiled and used. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r953935719


##########
src/main/java/org/apache/sysds/hops/DataOp.java:
##########
@@ -283,20 +308,28 @@ public Lop constructLops()
 		for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
 			inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
 		}
+		if(_ioGenRead)
+			inputLops.put("iogenformat", _generateReaderOp.constructLops());
 
 		// Create the lop
 		switch(_op) 
 		{
 			case TRANSIENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
+				if(!_ioGenRead)
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());
 				setOutputDimensions(l);
 				break;
 				
 			case PERSISTENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
-				l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
+				if(!_ioGenRead){
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+					l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
+				}
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());

Review Comment:
   I think it would be good if you set the output parameters(dimensions) if known. (like in the not IOGen above).
   
   The previous (before your additions) is bad design setting the variables after the call to the method, but it could be you do it inside your constructor? if you do not i suggest to use the method on line 328.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] fathollahzadeh commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
fathollahzadeh commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r953999101


##########
src/main/java/org/apache/sysds/hops/DataOp.java:
##########
@@ -283,20 +308,28 @@ public Lop constructLops()
 		for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
 			inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
 		}
+		if(_ioGenRead)
+			inputLops.put("iogenformat", _generateReaderOp.constructLops());
 
 		// Create the lop
 		switch(_op) 
 		{
 			case TRANSIENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
+				if(!_ioGenRead)
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());

Review Comment:
   We are identifying the format and generating corresponding readers to that and then reusing it multiple times, I think it can be Transient.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r953964952


##########
src/main/java/org/apache/sysds/runtime/iogen/CustomProperties.java:
##########
@@ -19,117 +19,149 @@
 
 package org.apache.sysds.runtime.iogen;
 
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
+import com.google.gson.Gson;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.DMLRuntimeException;
 import org.apache.sysds.runtime.io.FileFormatProperties;
-
+import org.apache.sysds.runtime.matrix.data.Pair;
+import java.io.IOException;
 import java.io.Serializable;
+import java.nio.file.Files;
+import java.nio.file.Paths;
 import java.util.HashSet;
 
 public class CustomProperties extends FileFormatProperties implements Serializable {

Review Comment:
   a comment on this class, to specify what custom properties means would be nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] Baunsgaard commented on a diff in pull request #1672: [GIO] New Implementation of IOGEN

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on code in PR #1672:
URL: https://github.com/apache/systemds/pull/1672#discussion_r953950627


##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -646,6 +649,13 @@ else if (node.getType() == Lop.Type.Nary) {
 					inst_string = node.getInstructions(inputs, 
 						node.getOutputParameters().getLabel());
 				}
+				else if(node instanceof ReaderGen)
+					inst_string = node.getInstructions();
+
+				else if(node instanceof DataIOGen){
+					inst_string = node.getInstructions();
+				}

Review Comment:
   These two else ifs should be combined. (and you need to decide if you want `{` or not)



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -380,10 +382,11 @@ private static List<Instruction> deleteUpdatedTransientReadVariables(StatementBl
 		
 		// first capture all transient read variables
 		for ( Lop node : nodeV ) {
-			if (node.isDataExecLocation()
+			if ((node.isDataIOGenExecLocation() && node.getDataType() == DataType.MATRIX)
+				|| (node.isDataExecLocation()

Review Comment:
   Format to clarify



##########
src/main/java/org/apache/sysds/runtime/io/FrameReaderJSONL.java:
##########
@@ -128,7 +128,8 @@ private static String getStringFromJSONPath(JSONObject jsonObject, String path)
 
 		}
 		if(temp == null){
-			throw new IOException("Could not traverse the JSON path: '" + path + "'!");
+			return null;
+			//throw new IOException("Could not traverse the JSON path: '" + path + "'!");

Review Comment:
   is the exception not better here?



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -466,13 +469,13 @@ private static List<Instruction> generateRemoveInstructions(StatementBlock sb) {
 	private static ArrayList<Instruction> generateInstructionsForInputVariables(List<Lop> nodes_v) {
 		ArrayList<Instruction> insts = new ArrayList<>();
 		for(Lop n : nodes_v) {
-			if (n.isDataExecLocation() 
+			if (n.isReaderGenExecLocation() || n.isDataIOGenExecLocation() || ( n.isDataExecLocation()

Review Comment:
   format



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -304,9 +306,9 @@ private ArrayList<Instruction> doPlainInstructionGen(StatementBlock sb, List<Lop
 		
 		// filter out non-executable nodes
 		List<Lop> execNodes = nodes.stream()
-			.filter(l -> (!l.isDataExecLocation() 
+			.filter(l -> (!l.isDataIOGenExecLocation() && !l.isReaderGenExecLocation() && (!l.isDataExecLocation()

Review Comment:
   is the parenthesis set correctly?



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -646,6 +649,13 @@ else if (node.getType() == Lop.Type.Nary) {
 					inst_string = node.getInstructions(inputs, 
 						node.getOutputParameters().getLabel());
 				}
+				else if(node instanceof ReaderGen)
+					inst_string = node.getInstructions();
+
+				else if(node instanceof DataIOGen){
+					inst_string = node.getInstructions();
+				}

Review Comment:
   Perhaps make a method on the node that returns true if 
   `node instanceof DataIOGen || node instanceof ReaderGen`



##########
src/main/java/org/apache/sysds/runtime/iogen/FormatIdentifying.java:
##########
@@ -0,0 +1,1270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.iogen;
+
+import org.apache.sysds.lops.Lop;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.matrix.data.MatrixBlock;
+import org.apache.sysds.runtime.matrix.data.Pair;
+
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class FormatIdentifying {

Review Comment:
   I would call the class  FormatIdentifyer, since usually classes would not be verbs.
     



##########
src/main/java/org/apache/sysds/hops/DataOp.java:
##########
@@ -283,20 +308,28 @@ public Lop constructLops()
 		for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
 			inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
 		}
+		if(_ioGenRead)
+			inputLops.put("iogenformat", _generateReaderOp.constructLops());
 
 		// Create the lop
 		switch(_op) 
 		{
 			case TRANSIENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
+				if(!_ioGenRead)
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());
 				setOutputDimensions(l);
 				break;
 				
 			case PERSISTENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
-				l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
+				if(!_ioGenRead){
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+					l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
+				}
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());

Review Comment:
   Here in IOGen, i think it would be good if you set the output parameters(dimensions) if known. (like in not IOGen).
   The previous is bad design setting the variables after the call to the method, but it could be you do it inside your constructor.
   



##########
src/main/java/org/apache/sysds/runtime/io/FrameReaderFactory.java:
##########
@@ -69,6 +72,22 @@ public static FrameReader createFrameReader(FileFormat fmt, FileFormatProperties
 				reader = new FrameReaderProto();
 				break;
 
+			case IOGEN:
+				CustomProperties customProperties = (CustomProperties) props;
+				String[] path = customProperties.getFormat().split("/");
+				String className = path[path.length -1].split("\\.")[0];
+				Pair<String, CustomProperties> loadSrcReaderAndProperties = customProperties.loadSrcReaderAndProperties();
+
+				GenerateReader.GenerateReaderFrame frm = new GenerateReader.GenerateReaderFrame(loadSrcReaderAndProperties.getValue(),
+					loadSrcReaderAndProperties.getKey(), className);
+				try {
+					reader = frm.getReader();
+				}
+				catch(Exception e) {
+					throw new DMLRuntimeException("IOGEN Matrix Reader Error: " + e);

Review Comment:
   not `+ e`, should be `, e`, to allow the stack trace to go through.



##########
src/main/java/org/apache/sysds/runtime/iogen/CustomProperties.java:
##########
@@ -19,117 +19,149 @@
 
 package org.apache.sysds.runtime.iogen;
 
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
+import com.google.gson.Gson;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.DMLRuntimeException;
 import org.apache.sysds.runtime.io.FileFormatProperties;
-
+import org.apache.sysds.runtime.matrix.data.Pair;
+import java.io.IOException;
 import java.io.Serializable;
+import java.nio.file.Files;
+import java.nio.file.Paths;
 import java.util.HashSet;
 
 public class CustomProperties extends FileFormatProperties implements Serializable {

Review Comment:
   a comment on this class, to specify what custom properties mean would be nice.



##########
src/main/java/org/apache/sysds/runtime/iogen/FormatIdentifying.java:
##########
@@ -0,0 +1,1270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.iogen;
+
+import org.apache.sysds.lops.Lop;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.matrix.data.MatrixBlock;
+import org.apache.sysds.runtime.matrix.data.Pair;
+
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class FormatIdentifying {
+
+	private int[][] mapRow;
+	private int[][] mapCol;
+	private int[][] mapLen;
+	private int actualValueCount;
+	private MappingProperties mappingProperties;
+	private ArrayList<RawIndex> sampleRawIndexes;
+
+	private static int nrows;
+	private static int ncols;
+	private int nlines;
+
+	private int windowSize = 20;
+	private int suffixStringLength = 50;
+	private ReaderMapping mappingValues;
+	private CustomProperties properties;
+
+	public FormatIdentifying(String raw, MatrixBlock matrix) throws Exception {
+		this.mappingValues = new ReaderMapping(raw, matrix);
+		this.runIdentification();
+	}
+
+	public FormatIdentifying(String raw, FrameBlock frame) throws Exception {
+		this.mappingValues = new ReaderMapping(raw, frame);
+		this.runIdentification();
+	}
+
+	private void runIdentification() {

Review Comment:
   This method needs to be split into reasonable parts, since it is currently over 400 lines long.



##########
src/main/java/org/apache/sysds/runtime/io/MatrixReaderFactory.java:
##########
@@ -117,6 +121,27 @@ public static MatrixReader createMatrixReader( ReadProperties props )  {
 				reader = (par & mcsr) ? new ReaderHDF5Parallel(fileFormatPropertiesHDF5) : new ReaderHDF5(
 					fileFormatPropertiesHDF5);
 				break;
+			case IOGEN:
+				CustomProperties customProperties;
+				if(props.formatProperties != null) {
+					customProperties = (CustomProperties) props.formatProperties;
+				}
+				else {
+					throw new DMLRuntimeException("Failed to create matrix reader with NULL CustomProperties");
+				}
+				String[] path = customProperties.getFormat().split("/");
+				String className = path[path.length -1].split("\\.")[0];
+				Pair<String, CustomProperties> loadSrcReaderAndProperties = customProperties.loadSrcReaderAndProperties();
+
+				GenerateReader.GenerateReaderMatrix grm = new GenerateReader.GenerateReaderMatrix(loadSrcReaderAndProperties.getValue(),
+					loadSrcReaderAndProperties.getKey(), className);
+				try {
+					reader = grm.getReader();
+				}
+				catch(Exception e) {
+					throw new DMLRuntimeException("IOGEN Matrix Reader Error: " + e);

Review Comment:
   not `+ e`



##########
src/main/java/org/apache/sysds/lops/DataIOGen.java:
##########
@@ -0,0 +1,348 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.lops;
+
+import org.apache.sysds.common.Types.DataType;
+import org.apache.sysds.common.Types.ExecType;
+import org.apache.sysds.common.Types.FileFormat;
+import org.apache.sysds.common.Types.OpOpData;
+import org.apache.sysds.common.Types.ValueType;
+import org.apache.sysds.parser.DataExpression;
+
+import java.util.HashMap;
+
+/**
+ * Lop to represent data objects. Data objects represent matrices, vectors,
+ * variables, literals. Can be for both input and output.
+ */
+public class DataIOGen extends Lop {
+	public static final String PREAD_PREFIX = "pREAD";
+	private final String formatType;
+	private final OpOpData _op;
+	private final boolean literal_var;
+	private HashMap<String, Lop> _inputParams;
+
+	/**
+	 * Method to create literal LOPs.
+	 *
+	 * @param vt           value type
+	 * @param literalValue literal value
+	 * @return literal low-level operator
+	 */
+	public static DataIOGen createLiteralLop(ValueType vt, String literalValue) {
+		// All literals have default format type of TEXT
+		return new DataIOGen(OpOpData.PERSISTENTREAD, null, null, null, literalValue, DataType.SCALAR, vt, FileFormat.TEXT.toString());
+	}
+
+	/**
+	 * Constructor to setup read or write LOP
+	 * In case of write: <code>input</code> must be provided. This will always be added as the first element in <code>input</code> array.
+	 * For literals: this function is invoked through the static method <code>createLiteralLop</code>.
+	 *
+	 * @param op                  operation type
+	 * @param input               low-level operator
+	 * @param inputParametersLops input lops
+	 * @param name                string name
+	 * @param literal             string literal
+	 * @param dt                  data type
+	 * @param vt                  value type
+	 * @param fmt                 file format
+	 */
+	public DataIOGen(OpOpData op, Lop input, HashMap<String, Lop> inputParametersLops, String name, String literal, DataType dt, ValueType vt,
+		String fmt) {
+		super(Type.Data, dt, vt);
+		_op = op;
+		literal_var = (literal != null);
+
+		// Either <code>name</code> or <code>literal</code> can be non-null.
+		if(literal_var) {
+			if(_op.isTransient())
+				throw new LopsException("Invalid parameter values while setting up a Data LOP -- transient flag is invalid for a literal.");
+			getOutputParameters().setLabel(literal);
+		}
+		else if(name != null) {
+			if(_op.isTransient())
+				getOutputParameters().setLabel(name); // tvar+name
+			else {
+				String code = _op == OpOpData.FUNCTIONOUTPUT ? "" : _op.isRead() ? "pREAD" : "pWRITE";
+				getOutputParameters().setLabel(code + name);
+			}
+		}
+		else {
+			throw new LopsException("Invalid parameter values while setting up a Data LOP -- the lop must have either literal value or a name.");
+		}
+
+		// WRITE operation must have an input Lops, we always put this
+		// input Lops as the first element of WRITE input. The parameters of
+		// WRITE operation are then put as the following input elements.
+		if(input != null && op.isWrite()) {
+			addInput(input);
+			input.addOutput(this);
+		}
+
+		_inputParams = inputParametersLops;
+
+		if(_inputParams != null) {
+			for(Lop lop : inputParametersLops.values()) {
+				addInput(lop);
+				lop.addOutput(this);
+			}
+			if(inputParametersLops.get(DataExpression.IO_FILENAME) != null) {
+				OutputParameters outParams = (inputParametersLops.get(DataExpression.IO_FILENAME)).getOutputParameters();
+				String fName = outParams.getLabel();
+				this.getOutputParameters().setFile_name(fName);
+			}
+		}
+
+		//set output format
+		formatType = fmt;
+		//outParams.setFormat(fmt);
+		setLopProperties();
+	}
+
+	private void setLopProperties() {
+		lps.setProperties(inputs, ExecType.INVALID);
+	}
+
+	/**
+	 * Data-Lop-specific method to set the execution type for persistent write.
+	 * TODO: split lops into MR/CP lop.
+	 *
+	 * @param et execution type
+	 */
+	public void setExecType(ExecType et) {
+		lps.execType = et;
+	}
+
+	/**
+	 * method to get format type for input, output files.
+	 *
+	 * @return file format
+	 */
+	public String getFileFormatType() {
+		return formatType;
+	}
+
+	@Override
+	public String toString() {
+		return getID() + ":" + "File_Name: " + getOutputParameters().getFile_name() + " " + "Label: " + getOutputParameters().getLabel() + " " + "Operation: = " + _op + " " + "Format: " + outParams.getFormat() + " Datatype: " + getDataType() + " Valuetype: " + getValueType() + " num_rows = " + getOutputParameters().getNumRows() + " num_cols = " + getOutputParameters().getNumCols() + " UpdateInPlace: " + getOutputParameters().getUpdateType();
+	}
+
+	/**
+	 * method to get operation type, i.e. read/write.
+	 *
+	 * @return operation type
+	 */
+

Review Comment:
   formatting



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -466,13 +469,13 @@ private static List<Instruction> generateRemoveInstructions(StatementBlock sb) {
 	private static ArrayList<Instruction> generateInstructionsForInputVariables(List<Lop> nodes_v) {
 		ArrayList<Instruction> insts = new ArrayList<>();
 		for(Lop n : nodes_v) {
-			if (n.isDataExecLocation() 
+			if (n.isReaderGenExecLocation() || n.isDataIOGenExecLocation() || ( n.isDataExecLocation()
 				&& !((Data) n).getOperationType().isTransient()
 				&& ((Data) n).getOperationType().isRead()
 				&& (n.getDataType() == DataType.MATRIX || n.getDataType() == DataType.FRAME 
-				   || n.getDataType() == DataType.LIST) )
+				   || n.getDataType() == DataType.LIST)))
 			{
-				if ( !((Data)n).isLiteral() ) {
+				if ( n.isDataIOGenExecLocation() || n.isReaderGenExecLocation() || (n.isDataExecLocation() && !((Data)n).isLiteral()) ) {

Review Comment:
   format



##########
src/main/java/org/apache/sysds/runtime/iogen/FormatIdentifying.java:
##########
@@ -0,0 +1,1270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.iogen;
+
+import org.apache.sysds.lops.Lop;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.matrix.data.MatrixBlock;
+import org.apache.sysds.runtime.matrix.data.Pair;
+
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class FormatIdentifying {
+
+	private int[][] mapRow;
+	private int[][] mapCol;
+	private int[][] mapLen;
+	private int actualValueCount;
+	private MappingProperties mappingProperties;
+	private ArrayList<RawIndex> sampleRawIndexes;
+
+	private static int nrows;
+	private static int ncols;
+	private int nlines;
+
+	private int windowSize = 20;
+	private int suffixStringLength = 50;
+	private ReaderMapping mappingValues;
+	private CustomProperties properties;
+
+	public FormatIdentifying(String raw, MatrixBlock matrix) throws Exception {
+		this.mappingValues = new ReaderMapping(raw, matrix);
+		this.runIdentification();
+	}
+
+	public FormatIdentifying(String raw, FrameBlock frame) throws Exception {
+		this.mappingValues = new ReaderMapping(raw, frame);
+		this.runIdentification();
+	}
+
+	private void runIdentification() {
+
+		/* Index properties:

Review Comment:
   if this is documentation, move it to the method definition, or( even better in my opinion) move it to the class definition.



##########
src/main/java/org/apache/sysds/hops/DataOp.java:
##########
@@ -283,20 +308,28 @@ public Lop constructLops()
 		for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
 			inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
 		}
+		if(_ioGenRead)
+			inputLops.put("iogenformat", _generateReaderOp.constructLops());
 
 		// Create the lop
 		switch(_op) 
 		{
 			case TRANSIENTREAD:
-				l = new Data(_op, null, inputLops, getName(), null, 
-						getDataType(), getValueType(), getFileFormat());
+				if(!_ioGenRead)
+					l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
+				else
+					l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());

Review Comment:
   I believe Transient Reads, should not be affected by DataIOGen.
   Transient Reads, only read a matrix from a previous block of code, it should not be connected to IO.
   (Correct me if i am wrong)
   



##########
src/main/java/org/apache/sysds/parser/StatementBlock.java:
##########
@@ -915,8 +915,19 @@ private void validateAssignmentStatement(Statement current, DMLProgram dmlProg,
 				dmlProg, ids.getVariables(),currConstVars, conditional);
 		}
 		else { //all builtin functions and expressions
-			if( target == null  )
-				raiseValidateError("Missing variable assignment.", false);
+			if( target == null  ) {
+				// check if IOGEN
+				if (source instanceof DataExpression && ((DataExpression)source).getOpCode() == Expression.DataOp.READ){
+					if(!(((DataExpression)source).getVarParam(DataExpression.SAMPLE_RAW) != null &&

Review Comment:
   nesting if statements, could perhaps be rewritten to one if?



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -304,9 +306,9 @@ private ArrayList<Instruction> doPlainInstructionGen(StatementBlock sb, List<Lop
 		
 		// filter out non-executable nodes
 		List<Lop> execNodes = nodes.stream()
-			.filter(l -> (!l.isDataExecLocation() 
+			.filter(l -> (!l.isDataIOGenExecLocation() && !l.isReaderGenExecLocation() && (!l.isDataExecLocation()

Review Comment:
   make a helper method in `l` that get this boolean out, (you do this rewrite multiple times.)



##########
src/main/java/org/apache/sysds/runtime/iogen/GenerateReader.java:
##########
@@ -59,35 +92,25 @@ public GenerateReaderMatrix(SampleProperties sampleProperties) throws Exception
 			super(sampleProperties);
 		}
 
-		public GenerateReaderMatrix(String sampleRaw, MatrixBlock sampleMatrix) throws Exception {
-			super(new SampleProperties(sampleRaw, sampleMatrix));
+		public GenerateReaderMatrix(CustomProperties properties, String src, String className) {
+			super(properties, src, className);
 		}
 
 		public MatrixReader getReader() throws Exception {
-
-			boolean isMapped = readerMapping != null && readerMapping.isMapped();
-			if(!isMapped) {
-				throw new Exception("Sample raw data and sample matrix don't match !!");
-			}
-			CustomProperties ffp = readerMapping.getFormatProperties();
-			if(ffp == null) {
-				throw new Exception("The file format couldn't recognize!!");
-			}
-			// 2. Generate a Matrix Reader:
-			if(ffp.getRowPattern().equals(CustomProperties.GRPattern.Regular)) {
-				if(ffp.getColPattern().equals(CustomProperties.GRPattern.Regular)) {
-					matrixReader = new MatrixGenerateReader.MatrixReaderRowRegularColRegular(ffp);
-				}
-				else {
-					matrixReader = new MatrixGenerateReader.MatrixReaderRowRegularColIrregular(ffp);
-				}
-			}
-			else {
-				matrixReader = new MatrixGenerateReader.MatrixReaderRowIrregular(ffp);
-			}
+			Class[] cArg = new Class[1];
+			cArg[0] = CustomProperties.class;
+			matrixReader = (MatrixReader) CodegenUtils.compileClass(className, src).getDeclaredConstructor(cArg).newInstance(properties);
 			return matrixReader;
 		}
 
+		@Override public String getReaderString() {

Review Comment:
   `override` should be on line above.



##########
src/main/java/org/apache/sysds/runtime/io/FrameReaderFactory.java:
##########
@@ -69,6 +72,22 @@ public static FrameReader createFrameReader(FileFormat fmt, FileFormatProperties
 				reader = new FrameReaderProto();
 				break;
 
+			case IOGEN:
+				CustomProperties customProperties = (CustomProperties) props;
+				String[] path = customProperties.getFormat().split("/");
+				String className = path[path.length -1].split("\\.")[0];
+				Pair<String, CustomProperties> loadSrcReaderAndProperties = customProperties.loadSrcReaderAndProperties();
+
+				GenerateReader.GenerateReaderFrame frm = new GenerateReader.GenerateReaderFrame(loadSrcReaderAndProperties.getValue(),
+					loadSrcReaderAndProperties.getKey(), className);
+				try {
+					reader = frm.getReader();
+				}
+				catch(Exception e) {
+					throw new DMLRuntimeException("IOGEN Matrix Reader Error: " + e);

Review Comment:
   alternatively you could not catch it, and let it just throw the exception if you do not find your message informative..



##########
src/main/java/org/apache/sysds/lops/DataIOGen.java:
##########
@@ -0,0 +1,348 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.lops;
+
+import org.apache.sysds.common.Types.DataType;
+import org.apache.sysds.common.Types.ExecType;
+import org.apache.sysds.common.Types.FileFormat;
+import org.apache.sysds.common.Types.OpOpData;
+import org.apache.sysds.common.Types.ValueType;
+import org.apache.sysds.parser.DataExpression;
+
+import java.util.HashMap;
+
+/**
+ * Lop to represent data objects. Data objects represent matrices, vectors,
+ * variables, literals. Can be for both input and output.
+ */
+public class DataIOGen extends Lop {
+	public static final String PREAD_PREFIX = "pREAD";
+	private final String formatType;
+	private final OpOpData _op;
+	private final boolean literal_var;
+	private HashMap<String, Lop> _inputParams;
+
+	/**
+	 * Method to create literal LOPs.
+	 *
+	 * @param vt           value type
+	 * @param literalValue literal value
+	 * @return literal low-level operator
+	 */
+	public static DataIOGen createLiteralLop(ValueType vt, String literalValue) {
+		// All literals have default format type of TEXT
+		return new DataIOGen(OpOpData.PERSISTENTREAD, null, null, null, literalValue, DataType.SCALAR, vt, FileFormat.TEXT.toString());
+	}
+
+	/**
+	 * Constructor to setup read or write LOP
+	 * In case of write: <code>input</code> must be provided. This will always be added as the first element in <code>input</code> array.
+	 * For literals: this function is invoked through the static method <code>createLiteralLop</code>.
+	 *
+	 * @param op                  operation type
+	 * @param input               low-level operator
+	 * @param inputParametersLops input lops
+	 * @param name                string name
+	 * @param literal             string literal
+	 * @param dt                  data type
+	 * @param vt                  value type
+	 * @param fmt                 file format
+	 */
+	public DataIOGen(OpOpData op, Lop input, HashMap<String, Lop> inputParametersLops, String name, String literal, DataType dt, ValueType vt,
+		String fmt) {
+		super(Type.Data, dt, vt);
+		_op = op;
+		literal_var = (literal != null);
+
+		// Either <code>name</code> or <code>literal</code> can be non-null.
+		if(literal_var) {
+			if(_op.isTransient())
+				throw new LopsException("Invalid parameter values while setting up a Data LOP -- transient flag is invalid for a literal.");
+			getOutputParameters().setLabel(literal);
+		}
+		else if(name != null) {
+			if(_op.isTransient())
+				getOutputParameters().setLabel(name); // tvar+name
+			else {
+				String code = _op == OpOpData.FUNCTIONOUTPUT ? "" : _op.isRead() ? "pREAD" : "pWRITE";
+				getOutputParameters().setLabel(code + name);
+			}
+		}
+		else {
+			throw new LopsException("Invalid parameter values while setting up a Data LOP -- the lop must have either literal value or a name.");
+		}
+
+		// WRITE operation must have an input Lops, we always put this
+		// input Lops as the first element of WRITE input. The parameters of
+		// WRITE operation are then put as the following input elements.
+		if(input != null && op.isWrite()) {
+			addInput(input);
+			input.addOutput(this);
+		}
+
+		_inputParams = inputParametersLops;
+
+		if(_inputParams != null) {
+			for(Lop lop : inputParametersLops.values()) {
+				addInput(lop);
+				lop.addOutput(this);
+			}
+			if(inputParametersLops.get(DataExpression.IO_FILENAME) != null) {
+				OutputParameters outParams = (inputParametersLops.get(DataExpression.IO_FILENAME)).getOutputParameters();
+				String fName = outParams.getLabel();
+				this.getOutputParameters().setFile_name(fName);
+			}
+		}
+
+		//set output format
+		formatType = fmt;
+		//outParams.setFormat(fmt);
+		setLopProperties();
+	}
+
+	private void setLopProperties() {
+		lps.setProperties(inputs, ExecType.INVALID);
+	}
+
+	/**
+	 * Data-Lop-specific method to set the execution type for persistent write.
+	 * TODO: split lops into MR/CP lop.
+	 *
+	 * @param et execution type
+	 */
+	public void setExecType(ExecType et) {
+		lps.execType = et;
+	}
+
+	/**
+	 * method to get format type for input, output files.
+	 *
+	 * @return file format
+	 */
+	public String getFileFormatType() {
+		return formatType;
+	}
+
+	@Override
+	public String toString() {
+		return getID() + ":" + "File_Name: " + getOutputParameters().getFile_name() + " " + "Label: " + getOutputParameters().getLabel() + " " + "Operation: = " + _op + " " + "Format: " + outParams.getFormat() + " Datatype: " + getDataType() + " Valuetype: " + getValueType() + " num_rows = " + getOutputParameters().getNumRows() + " num_cols = " + getOutputParameters().getNumCols() + " UpdateInPlace: " + getOutputParameters().getUpdateType();

Review Comment:
   newlines and formatting.



##########
src/main/java/org/apache/sysds/lops/compile/Dag.java:
##########
@@ -304,9 +306,9 @@ private ArrayList<Instruction> doPlainInstructionGen(StatementBlock sb, List<Lop
 		
 		// filter out non-executable nodes
 		List<Lop> execNodes = nodes.stream()
-			.filter(l -> (!l.isDataExecLocation() 
+			.filter(l -> (!l.isDataIOGenExecLocation() && !l.isReaderGenExecLocation() && (!l.isDataExecLocation()

Review Comment:
   oh i see it is correct, but just very hard to find. Could you format it differently to make it clear?



##########
src/main/java/org/apache/sysds/runtime/iogen/FormatIdentifying.java:
##########
@@ -0,0 +1,1270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.iogen;
+
+import org.apache.sysds.lops.Lop;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.matrix.data.MatrixBlock;
+import org.apache.sysds.runtime.matrix.data.Pair;
+
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class FormatIdentifying {
+
+	private int[][] mapRow;
+	private int[][] mapCol;
+	private int[][] mapLen;
+	private int actualValueCount;
+	private MappingProperties mappingProperties;
+	private ArrayList<RawIndex> sampleRawIndexes;
+
+	private static int nrows;

Review Comment:
   why static, is the nrows not determined by the values read?



##########
src/main/java/org/apache/sysds/common/Types.java:
##########
@@ -535,6 +535,16 @@ public String toString() {
 		}
 	}
 
+	public enum OpOpGenerateReader {

Review Comment:
   Do you need this enum if it only has one enum value?



##########
src/main/java/org/apache/sysds/runtime/iogen/GenerateReader.java:
##########
@@ -99,33 +122,24 @@ public GenerateReaderFrame(SampleProperties sampleProperties) throws Exception {
 			super(sampleProperties);
 		}
 
-		public GenerateReaderFrame(String sampleRaw, FrameBlock sampleFrame) throws Exception {
-			super(new SampleProperties(sampleRaw, sampleFrame));
+		public GenerateReaderFrame(CustomProperties properties, String src, String className) {
+			super(properties, src, className);
 		}
 
 		public FrameReader getReader() throws Exception {
-
-			boolean isMapped = readerMapping != null && readerMapping.isMapped();
-			if(!isMapped) {
-				throw new Exception("Sample raw data and sample frame don't match !!");
-			}
-			CustomProperties ffp = readerMapping.getFormatProperties();
-			if(ffp == null) {
-				throw new Exception("The file format couldn't recognize!!");
-			}
-			// 2. Generate a Frame Reader:
-			if(ffp.getRowPattern().equals(CustomProperties.GRPattern.Regular)) {
-				if(ffp.getColPattern().equals(CustomProperties.GRPattern.Regular)) {
-					frameReader = new FrameGenerateReader.FrameReaderRowRegularColRegular(ffp);
-				}
-				else {
-					frameReader = new FrameGenerateReader.FrameReaderRowRegularColIrregular(ffp);
-				}
-			}
-			else {
-				frameReader = new FrameGenerateReader.FrameReaderRowIrregular(ffp);
-			}
+			Class[] cArg = new Class[1];
+			cArg[0] = CustomProperties.class;
+			frameReader = (FrameReader) CodegenUtils.compileClass(className, src).getDeclaredConstructor(cArg).newInstance(properties);
 			return frameReader;
 		}
+
+		@Override public String getReaderString() {

Review Comment:
   format



##########
src/main/java/org/apache/sysds/runtime/iogen/RawIndex.java:
##########
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.iogen;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.lops.Lop;
+import org.apache.sysds.runtime.matrix.data.Pair;
+import org.apache.sysds.runtime.util.UtilFunctions;
+
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+
+public class RawIndex {
+	private String raw;
+	private int rawLength;
+	private BitSet numberBitSet;
+	private BitSet dotBitSet;
+	private BitSet eBitSet;
+	private BitSet plusMinusBitSet;
+	private BitSet reservedPositions;
+	private BitSet backupReservedPositions;
+	private HashMap<Double, ArrayList<Pair<Integer, Integer>>> actualNumericValues;
+	private HashMap<Double, ArrayList<Pair<Integer, Integer>>> dotActualNumericValues;
+	private HashMap<Double, ArrayList<Pair<Integer, Integer>>> dotEActualNumericValues;
+
+	public RawIndex(String raw) {
+		this.raw = raw;
+		this.rawLength = raw.length();
+		this.numberBitSet = new BitSet(rawLength);
+		this.dotBitSet = new BitSet(rawLength);
+		this.eBitSet = new BitSet(rawLength);
+		this.plusMinusBitSet = new BitSet(rawLength);
+		this.reservedPositions = new BitSet(rawLength);
+		this.backupReservedPositions = new BitSet(rawLength);
+		this.actualNumericValues = null;
+		this.dotActualNumericValues = null;
+		this.dotEActualNumericValues = new HashMap<>();
+
+		for(int i = 0; i < this.rawLength; i++) {
+			switch(raw.charAt(i)) {
+				case '0':
+				case '1':
+				case '2':
+				case '3':
+				case '4':
+				case '5':
+				case '6':
+				case '7':
+				case '8':
+				case '9':
+					this.numberBitSet.set(i);
+					break;
+				case '+':
+				case '-':
+					this.plusMinusBitSet.set(i);
+					break;
+				case '.':
+					this.dotBitSet.set(i);
+					break;
+				case 'e':
+				case 'E':
+					this.eBitSet.set(i);
+					break;
+			}
+		}
+		// Clean unnecessary sets
+		// Clean for "."
+		for(int i = dotBitSet.nextSetBit(0); i != -1; i = dotBitSet.nextSetBit(i + 1)) {
+			boolean flag = false;
+			if(i > 0) {
+				if(i < rawLength - 2) {
+					flag = !numberBitSet.get(i - 1) && !numberBitSet.get(i + 1) && !plusMinusBitSet.get(i + 1) && !eBitSet.get(i + 1);
+				}
+			}
+			else if(i == rawLength - 1) {
+				flag = !numberBitSet.get(i - 1);
+			}
+			else if(i == 0) {
+				if(i < rawLength - 2) {
+					flag = !numberBitSet.get(i + 1) && !plusMinusBitSet.get(i + 1) && !eBitSet.get(i + 1);
+				}
+				else if(i == rawLength - 1) {
+					flag = true;
+				}
+			}
+
+			if(flag)
+				dotBitSet.set(i, false);
+		}
+
+		// Clean for "+/-"
+		for(int i = plusMinusBitSet.nextSetBit(0); i != -1; i = plusMinusBitSet.nextSetBit(i + 1)) {
+			boolean flag;
+			if(i < rawLength - 1) {
+				flag = numberBitSet.get(i + 1);
+				if(!flag && i < rawLength - 2)
+					flag = dotBitSet.get(i + 1) && numberBitSet.get(i + 2);
+			}
+			else {
+				flag = false;
+			}
+			if(!flag)
+				plusMinusBitSet.set(i, false);
+		}
+
+		// Clean for "e/E"
+		for(int i = eBitSet.nextSetBit(0); i != -1; i = eBitSet.nextSetBit(i + 1)) {
+			boolean flag = false;
+			if((i == 1 && !numberBitSet.get(0)) || i == 0 || i == rawLength - 1) {
+				flag = false;
+			}
+			else if(i > 1 && i < rawLength - 2) {
+				flag = numberBitSet.get(i - 1) || (numberBitSet.get(i - 2) && dotBitSet.get(i - 1));
+				if(flag)
+					flag = numberBitSet.get(i + 1) || (numberBitSet.get(i + 2) && plusMinusBitSet.get(i + 1));
+			}
+			else if(i == rawLength - 2) {
+				flag = numberBitSet.get(rawLength - 1);
+			}
+			if(!flag)
+				eBitSet.set(i, false);
+		}
+		if(numberBitSet.cardinality() > 0)
+			extractNumericDotEActualValues();
+	}
+
+	public RawIndex() {}
+
+	public Pair<Integer, Integer> findValue(Object value, Types.ValueType valueType) {
+		if(valueType.isNumeric())
+			return findValue(UtilFunctions.getDouble(value));
+		else if(valueType == Types.ValueType.STRING) {
+			String os = UtilFunctions.objectToString(value);
+			if(os == null || os.length() == 0)
+				return null;
+			else
+				return findValue(os);
+		}
+		//		else if(valueType == Types.ValueType.BOOLEAN)
+		//			return findValue(UtilFunctions.objectToString())
+		else
+			return null;
+	}
+
+	public Pair<Integer, Integer> findValue(double value) {
+		//		extractNumericActualValues();
+		//		if(actualNumericValues.containsKey(value)){
+		//			return getValuePositionAndLength(actualNumericValues.get(value));
+		//		}
+		//
+		//		extractNumericDotActualValues();
+		//		if(dotActualNumericValues.containsKey(value)){
+		//			return getValuePositionAndLength(dotActualNumericValues.get(value));
+		//		}
+		//
+		//		extractNumericDotEActualValues();
+		if(dotEActualNumericValues.containsKey(value)) {
+			return getValuePositionAndLength(dotEActualNumericValues.get(value));
+		}
+		return null;
+	}
+
+	private Pair<Integer, Integer> findValue(String value) {
+		int index = 0;
+		boolean flag;
+		do {
+			flag = true;
+			index = this.raw.indexOf(value, index);
+			if(index == -1)
+				return null;
+
+			for(int i = index; i < index + value.length(); i++)
+				if(reservedPositions.get(i)) {
+					flag = false;

Review Comment:
   only one place you set flag. does this not mean you can put your return logic in here. and then remove the flag variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org