You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Shirish Tatikonda (JIRA)" <ji...@apache.org> on 2016/02/23 08:42:18 UTC

[jira] [Commented] (SYSTEMML-153) Allow input data file without requiring corresponding metadata file

    [ https://issues.apache.org/jira/browse/SYSTEMML-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158480#comment-15158480 ] 

Shirish Tatikonda commented on SYSTEMML-153:
--------------------------------------------


I agree with Matthias and Berthold that format inference through file extensions is not appropriate.
I see the argument of Deron and others about the usability as well. 

As a compromise, how about we introduce some rules for _format inference_, which get triggered when the mtd file is absent? 

Initial set of rules (may have to be hardened):
1) If the input data is binary format, then format = "binary"
2) Else If the first line is {{%%MatrixMarket matrix coordinate real general}}, then format = "mm" (we already do this)
3) Else If the first line has three fields with a whitespace as the delimiter, then format = "text" (i.e., "text" takes precedence over "csv")
4) Else if the first line is perfectly delimited by a non-numeric character, then format = "csv"
5) Else error out.

Furthermore, if the inferred data format is "binary"/"text" but the dimensions and block dimensions (in case of binary) are unknown, we error out (as we currently do).



> Allow input data file without requiring corresponding metadata file
> -------------------------------------------------------------------
>
>                 Key: SYSTEMML-153
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-153
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Deron Eriksson
>
> Right now a metadata file is required for an input data file. For example, a matrix.csv file would typically require a matrix.csv.mtd file. Creating a .mtd manually is a minor annoyance in terms of consumability of SystemML. It would be nice if there were some mechanism so that a metadata file does not need to be provided in all cases.
> One possibility is that if no metadata file is present, SystemML could assume a particular default format (for example, a comma-separated delimited file). The number of rows and columns could be determined by parsing the file. This might work well for small files but not necessarily well for enormous files.
> A possible way to solve this would be to use a file extension to indicate that you have a small input data file and you don't want to have to provide a metadata file. For example, you could have a matrix.csv-nomtd file. The .csv part of the extension indicates that it's a csv file, and the -nomtd part of the extension indicates that you don't want to provide metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)