You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Ethan Xu <et...@us.ibm.com> on 2016/02/02 17:32:58 UTC

User friendly output of univariate statistics

dml is quite amazing. I was wondering if there is a user friendly (more 
human readable) version of outputs from Univar-Stats.dml? I ran the 
Univar-Stats.dml on my data set that contains 7 variables: two continuous, 
one categorical. The output is a csv file on HDFS that looks like this:

1 1 10.0
2 1 123.0
2 7 469.0
3 1 122.0
3 7 419.0
4 1 34.852512104922082
4 7 0.40786451178676335
5 1 613.6600902369631
5 7 1.5322171660886
6 1 25.566777079580508
6 7 5.54382044429201915
7 1 0.219263232610989764
7 7 12.14558700418414E-4
8 1 0.5323447433694138
8 7 1.23151883029726626
9 1 0.28352047550156284
9 7 23.25049533659206
10 1 -0.5348573740280274
10 7 2023.294658877635
11 1 2.874872545380876E-4
11 7 1.874872545380876E-4
12 1 6.0017749742760714085
12 7 0.00237749742760714085
13 1 12.0
14 1 30.56066514110724
15 2 4.0
---- truncated (numbers randomly modified)

According to the documentation on 
http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
, the first column of the matrix represents statistics type (minimum, 
mean, etc.), the second column represents variable ID and the last column 
gives the statistics value. 

While the documentation is very clear and the results are consistent with 
outputs of other software like R, I found the format a bit inconvenient 
since I have to refer to the reference Table (table 1 in aforementioned 
link) to understand the summary statistics. 

I understand that the pure numeric matrix format is easy to use as machine 
input for future steps. An additional table that is more human readable 
would be nice since the main purpose of uni-variate statistics is often 
exploratory data analysis and a clear summary is essential. 

Suggestions to consider in the readable summary if there's not already 
one:
1. Order the rows according to variables (column 2) instead of statistics 
type (column 1), so that summary statistics of the same variable are 
grouped together.
2. Use actual statistics labels ("min", "mean", "skewness" etc) instead of 
IDs (1, 2, etc).
3. Use actual predictor labels ("age", "gender", etc) instead of IDs (1,2, 
etc).
4. Use level labels for categorical predictors ("male", "female", etc) 
instead of IDs (1,2, etc).
5. Add counts of cases in each level for categorical variable in addition 
to modes. This gives the distribution information of the variable.
6. If the amount of data in the summary is manageable perhaps 
automatically pull the output of Univar-Stats.dml from HDFS to local 
machine and display the readable version on terminal? 

So the output could look like:

age min 10
age max 123
age range 113
age mean 60
...
gender female.count 1000
gender male.count 2000
gender mode male
...

or even a table format like in R:

age                  gender
min    10          female 1000
max   123        male 2000
range 113        mode male
mean  60         ...
...
Thanks much, 

Ethan Xu


Re: User friendly output of univariate statistics

Posted by Shirish Tatikonda <sh...@gmail.com>.
You can easily accomplish that by simply writing the transpose of stats
matrix -- i.e., write(baseStats, $STATS) to write(t(baseStats), $STATS)



On Thu, Feb 4, 2016 at 7:33 PM, Ethan Xu <et...@gmail.com> wrote:

> Thanks for the clarification Shirish. Is the current 'ijv' format matrix of
> the Univ-Stats.dml output used in any other build-in script?
>
> If not I'd like to suggest a small change besides (or without) the user
> friendly version that makes outcomes easier to read: switch 'i' and 'j' in
> the outcome. That is, order rows of the matrix according to variables
> (original j column) then the statistics type (original i column). This way
> the info of one variable are grouped together.
>
> There might be situations where grouping by statistics types make more
> sense, but I felt the other way is more commonly used.
>
> Ethan
>
> On Thu, Feb 4, 2016 at 10:10 PM, Shirish Tatikonda <
> shirish.tatikonda@gmail.com> wrote:
>
> > Just to clarify: the current output is actually a matrix, in which rows
> > denote stats and columns denote input variables. So, the output you see
> is
> > simply the univariate stats matrix in IJV format.
> > In a general case, the primary data type for input/output and
> computations
> > in SystemML is a *matrix *(of course, *scalar* as well) -- with one
> > exception of a *frame* type (which is used only in the context of
> > *transform*).
> >
> > I agree with you that providing user-friendly output as in R output is
> very
> > useful for data scientists -- it however requires a lot of effort to
> > support such a functionality.
> >
> > Shirish
> >
> > On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <et...@us.ibm.com> wrote:
> >
> > > Thank you Deron. From my personal experience printing a single type of
> > > user-friendly result on console is usually enough for a quick
> inspection.
> > > However that's in an interactive environment (like R interactive
> > session),
> > > where recreating the printout is simple.
> > >
> > > Since calling a dml scrip on hadoop might trigger a MapReduce job maybe
> > > it's better to save the user-friendly version as a file too? Or perhaps
> > > it's helpful to have a script that takes the original summary (plus
> some
> > > metadata) as input, and produces the user-friendly output?
> > >
> > > Best,
> > >
> > > Ethan
> > >
> > >
> > >
> > > From:   Deron Eriksson <de...@gmail.com>
> > > To:     dev@systemml.incubator.apache.org
> > > Date:   02/03/2016 01:13 AM
> > > Subject:        Re: User friendly output of univariate statistics
> > >
> > >
> > >
> > > Hi Ethan,
> > >
> > > I think you make a great point with regards to the readability of the
> > > output from Univar-Stats.dml.
> > >
> > > Do you think outputting the user-friendly results in the format you
> > > describe to the console while still writing the more mathematical
> results
> > > to a file would be the type of behavior that you would find most
> useful?
> > > Or
> > > would you also like to see the user-friendly results also sent to a
> file?
> > >
> > > Also, I was wondering, do you think a single user-friendly format is
> > > sufficient, or do you think that data scientists would like (or expect)
> > to
> > > be able to have multiple formats such as you described?
> > >
> > > The table format is very interesting. Currently DML has a basic print
> > > statement, but I don't believe it can be used to format data into
> > columns,
> > > such as in your table format example. It might be very nice to add a
> > > c-style "printf" statement, which would allow results to be written to
> > the
> > > console in a more columnar format.
> > >
> > > Does anyone else have any thoughts?
> > >
> > > Deron
> > >
> > >
> > > On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <et...@us.ibm.com> wrote:
> > >
> > > > dml is quite amazing. I was wondering if there is a user friendly
> (more
> > > > human readable) version of outputs from Univar-Stats.dml? I ran the
> > > > Univar-Stats.dml on my data set that contains 7 variables: two
> > > continuous,
> > > > one categorical. The output is a csv file on HDFS that looks like
> this:
> > > >
> > > > 1 1 10.0
> > > > 2 1 123.0
> > > > 2 7 469.0
> > > > 3 1 122.0
> > > > 3 7 419.0
> > > > 4 1 34.852512104922082
> > > > 4 7 0.40786451178676335
> > > > 5 1 613.6600902369631
> > > > 5 7 1.5322171660886
> > > > 6 1 25.566777079580508
> > > > 6 7 5.54382044429201915
> > > > 7 1 0.219263232610989764
> > > > 7 7 12.14558700418414E-4
> > > > 8 1 0.5323447433694138
> > > > 8 7 1.23151883029726626
> > > > 9 1 0.28352047550156284
> > > > 9 7 23.25049533659206
> > > > 10 1 -0.5348573740280274
> > > > 10 7 2023.294658877635
> > > > 11 1 2.874872545380876E-4
> > > > 11 7 1.874872545380876E-4
> > > > 12 1 6.0017749742760714085
> > > > 12 7 0.00237749742760714085
> > > > 13 1 12.0
> > > > 14 1 30.56066514110724
> > > > 15 2 4.0
> > > > ---- truncated (numbers randomly modified)
> > > >
> > > > According to the documentation on
> > > >
> > > >
> > >
> > >
> >
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
> > >
> > > > , the first column of the matrix represents statistics type (minimum,
> > > > mean, etc.), the second column represents variable ID and the last
> > > column
> > > > gives the statistics value.
> > > >
> > > > While the documentation is very clear and the results are consistent
> > > with
> > > > outputs of other software like R, I found the format a bit
> inconvenient
> > > > since I have to refer to the reference Table (table 1 in
> aforementioned
> > > > link) to understand the summary statistics.
> > > >
> > > > I understand that the pure numeric matrix format is easy to use as
> > > machine
> > > > input for future steps. An additional table that is more human
> readable
> > > > would be nice since the main purpose of uni-variate statistics is
> often
> > > > exploratory data analysis and a clear summary is essential.
> > > >
> > > > Suggestions to consider in the readable summary if there's not
> already
> > > > one:
> > > > 1. Order the rows according to variables (column 2) instead of
> > > statistics
> > > > type (column 1), so that summary statistics of the same variable are
> > > > grouped together.
> > > > 2. Use actual statistics labels ("min", "mean", "skewness" etc)
> instead
> > > of
> > > > IDs (1, 2, etc).
> > > > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs
> > > (1,2,
> > > > etc).
> > > > 4. Use level labels for categorical predictors ("male", "female",
> etc)
> > > > instead of IDs (1,2, etc).
> > > > 5. Add counts of cases in each level for categorical variable in
> > > addition
> > > > to modes. This gives the distribution information of the variable.
> > > > 6. If the amount of data in the summary is manageable perhaps
> > > > automatically pull the output of Univar-Stats.dml from HDFS to local
> > > > machine and display the readable version on terminal?
> > > >
> > > > So the output could look like:
> > > >
> > > > age min 10
> > > > age max 123
> > > > age range 113
> > > > age mean 60
> > > > ...
> > > > gender female.count 1000
> > > > gender male.count 2000
> > > > gender mode male
> > > > ...
> > > >
> > > > or even a table format like in R:
> > > >
> > > > age                  gender
> > > > min    10          female 1000
> > > > max   123        male 2000
> > > > range 113        mode male
> > > > mean  60         ...
> > > > ...
> > > > Thanks much,
> > > >
> > > > Ethan Xu
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Yifan "Ethan" Xu, PhD
>
> Data Scientist / Statistician
> Explorys, IBM Watson Health
>
> Adjunct Faculty
> Department of Epidemiology and Biostatistics
> Case Western Reserve University
>
> --------------
> Email: ethan.yifanxu@gmail.com
> Phone: (607) 760-6817
> --------------
>

Re: User friendly output of univariate statistics

Posted by Ethan Xu <et...@gmail.com>.
Thanks for the clarification Shirish. Is the current 'ijv' format matrix of
the Univ-Stats.dml output used in any other build-in script?

If not I'd like to suggest a small change besides (or without) the user
friendly version that makes outcomes easier to read: switch 'i' and 'j' in
the outcome. That is, order rows of the matrix according to variables
(original j column) then the statistics type (original i column). This way
the info of one variable are grouped together.

There might be situations where grouping by statistics types make more
sense, but I felt the other way is more commonly used.

Ethan

On Thu, Feb 4, 2016 at 10:10 PM, Shirish Tatikonda <
shirish.tatikonda@gmail.com> wrote:

> Just to clarify: the current output is actually a matrix, in which rows
> denote stats and columns denote input variables. So, the output you see is
> simply the univariate stats matrix in IJV format.
> In a general case, the primary data type for input/output and computations
> in SystemML is a *matrix *(of course, *scalar* as well) -- with one
> exception of a *frame* type (which is used only in the context of
> *transform*).
>
> I agree with you that providing user-friendly output as in R output is very
> useful for data scientists -- it however requires a lot of effort to
> support such a functionality.
>
> Shirish
>
> On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <et...@us.ibm.com> wrote:
>
> > Thank you Deron. From my personal experience printing a single type of
> > user-friendly result on console is usually enough for a quick inspection.
> > However that's in an interactive environment (like R interactive
> session),
> > where recreating the printout is simple.
> >
> > Since calling a dml scrip on hadoop might trigger a MapReduce job maybe
> > it's better to save the user-friendly version as a file too? Or perhaps
> > it's helpful to have a script that takes the original summary (plus some
> > metadata) as input, and produces the user-friendly output?
> >
> > Best,
> >
> > Ethan
> >
> >
> >
> > From:   Deron Eriksson <de...@gmail.com>
> > To:     dev@systemml.incubator.apache.org
> > Date:   02/03/2016 01:13 AM
> > Subject:        Re: User friendly output of univariate statistics
> >
> >
> >
> > Hi Ethan,
> >
> > I think you make a great point with regards to the readability of the
> > output from Univar-Stats.dml.
> >
> > Do you think outputting the user-friendly results in the format you
> > describe to the console while still writing the more mathematical results
> > to a file would be the type of behavior that you would find most useful?
> > Or
> > would you also like to see the user-friendly results also sent to a file?
> >
> > Also, I was wondering, do you think a single user-friendly format is
> > sufficient, or do you think that data scientists would like (or expect)
> to
> > be able to have multiple formats such as you described?
> >
> > The table format is very interesting. Currently DML has a basic print
> > statement, but I don't believe it can be used to format data into
> columns,
> > such as in your table format example. It might be very nice to add a
> > c-style "printf" statement, which would allow results to be written to
> the
> > console in a more columnar format.
> >
> > Does anyone else have any thoughts?
> >
> > Deron
> >
> >
> > On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <et...@us.ibm.com> wrote:
> >
> > > dml is quite amazing. I was wondering if there is a user friendly (more
> > > human readable) version of outputs from Univar-Stats.dml? I ran the
> > > Univar-Stats.dml on my data set that contains 7 variables: two
> > continuous,
> > > one categorical. The output is a csv file on HDFS that looks like this:
> > >
> > > 1 1 10.0
> > > 2 1 123.0
> > > 2 7 469.0
> > > 3 1 122.0
> > > 3 7 419.0
> > > 4 1 34.852512104922082
> > > 4 7 0.40786451178676335
> > > 5 1 613.6600902369631
> > > 5 7 1.5322171660886
> > > 6 1 25.566777079580508
> > > 6 7 5.54382044429201915
> > > 7 1 0.219263232610989764
> > > 7 7 12.14558700418414E-4
> > > 8 1 0.5323447433694138
> > > 8 7 1.23151883029726626
> > > 9 1 0.28352047550156284
> > > 9 7 23.25049533659206
> > > 10 1 -0.5348573740280274
> > > 10 7 2023.294658877635
> > > 11 1 2.874872545380876E-4
> > > 11 7 1.874872545380876E-4
> > > 12 1 6.0017749742760714085
> > > 12 7 0.00237749742760714085
> > > 13 1 12.0
> > > 14 1 30.56066514110724
> > > 15 2 4.0
> > > ---- truncated (numbers randomly modified)
> > >
> > > According to the documentation on
> > >
> > >
> >
> >
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
> >
> > > , the first column of the matrix represents statistics type (minimum,
> > > mean, etc.), the second column represents variable ID and the last
> > column
> > > gives the statistics value.
> > >
> > > While the documentation is very clear and the results are consistent
> > with
> > > outputs of other software like R, I found the format a bit inconvenient
> > > since I have to refer to the reference Table (table 1 in aforementioned
> > > link) to understand the summary statistics.
> > >
> > > I understand that the pure numeric matrix format is easy to use as
> > machine
> > > input for future steps. An additional table that is more human readable
> > > would be nice since the main purpose of uni-variate statistics is often
> > > exploratory data analysis and a clear summary is essential.
> > >
> > > Suggestions to consider in the readable summary if there's not already
> > > one:
> > > 1. Order the rows according to variables (column 2) instead of
> > statistics
> > > type (column 1), so that summary statistics of the same variable are
> > > grouped together.
> > > 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead
> > of
> > > IDs (1, 2, etc).
> > > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs
> > (1,2,
> > > etc).
> > > 4. Use level labels for categorical predictors ("male", "female", etc)
> > > instead of IDs (1,2, etc).
> > > 5. Add counts of cases in each level for categorical variable in
> > addition
> > > to modes. This gives the distribution information of the variable.
> > > 6. If the amount of data in the summary is manageable perhaps
> > > automatically pull the output of Univar-Stats.dml from HDFS to local
> > > machine and display the readable version on terminal?
> > >
> > > So the output could look like:
> > >
> > > age min 10
> > > age max 123
> > > age range 113
> > > age mean 60
> > > ...
> > > gender female.count 1000
> > > gender male.count 2000
> > > gender mode male
> > > ...
> > >
> > > or even a table format like in R:
> > >
> > > age                  gender
> > > min    10          female 1000
> > > max   123        male 2000
> > > range 113        mode male
> > > mean  60         ...
> > > ...
> > > Thanks much,
> > >
> > > Ethan Xu
> > >
> > >
> >
> >
> >
> >
> >
>



-- 
Yifan "Ethan" Xu, PhD

Data Scientist / Statistician
Explorys, IBM Watson Health

Adjunct Faculty
Department of Epidemiology and Biostatistics
Case Western Reserve University

--------------
Email: ethan.yifanxu@gmail.com
Phone: (607) 760-6817
--------------

Re: User friendly output of univariate statistics

Posted by Shirish Tatikonda <sh...@gmail.com>.
Just to clarify: the current output is actually a matrix, in which rows
denote stats and columns denote input variables. So, the output you see is
simply the univariate stats matrix in IJV format.
In a general case, the primary data type for input/output and computations
in SystemML is a *matrix *(of course, *scalar* as well) -- with one
exception of a *frame* type (which is used only in the context of
*transform*).

I agree with you that providing user-friendly output as in R output is very
useful for data scientists -- it however requires a lot of effort to
support such a functionality.

Shirish

On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <et...@us.ibm.com> wrote:

> Thank you Deron. From my personal experience printing a single type of
> user-friendly result on console is usually enough for a quick inspection.
> However that's in an interactive environment (like R interactive session),
> where recreating the printout is simple.
>
> Since calling a dml scrip on hadoop might trigger a MapReduce job maybe
> it's better to save the user-friendly version as a file too? Or perhaps
> it's helpful to have a script that takes the original summary (plus some
> metadata) as input, and produces the user-friendly output?
>
> Best,
>
> Ethan
>
>
>
> From:   Deron Eriksson <de...@gmail.com>
> To:     dev@systemml.incubator.apache.org
> Date:   02/03/2016 01:13 AM
> Subject:        Re: User friendly output of univariate statistics
>
>
>
> Hi Ethan,
>
> I think you make a great point with regards to the readability of the
> output from Univar-Stats.dml.
>
> Do you think outputting the user-friendly results in the format you
> describe to the console while still writing the more mathematical results
> to a file would be the type of behavior that you would find most useful?
> Or
> would you also like to see the user-friendly results also sent to a file?
>
> Also, I was wondering, do you think a single user-friendly format is
> sufficient, or do you think that data scientists would like (or expect) to
> be able to have multiple formats such as you described?
>
> The table format is very interesting. Currently DML has a basic print
> statement, but I don't believe it can be used to format data into columns,
> such as in your table format example. It might be very nice to add a
> c-style "printf" statement, which would allow results to be written to the
> console in a more columnar format.
>
> Does anyone else have any thoughts?
>
> Deron
>
>
> On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <et...@us.ibm.com> wrote:
>
> > dml is quite amazing. I was wondering if there is a user friendly (more
> > human readable) version of outputs from Univar-Stats.dml? I ran the
> > Univar-Stats.dml on my data set that contains 7 variables: two
> continuous,
> > one categorical. The output is a csv file on HDFS that looks like this:
> >
> > 1 1 10.0
> > 2 1 123.0
> > 2 7 469.0
> > 3 1 122.0
> > 3 7 419.0
> > 4 1 34.852512104922082
> > 4 7 0.40786451178676335
> > 5 1 613.6600902369631
> > 5 7 1.5322171660886
> > 6 1 25.566777079580508
> > 6 7 5.54382044429201915
> > 7 1 0.219263232610989764
> > 7 7 12.14558700418414E-4
> > 8 1 0.5323447433694138
> > 8 7 1.23151883029726626
> > 9 1 0.28352047550156284
> > 9 7 23.25049533659206
> > 10 1 -0.5348573740280274
> > 10 7 2023.294658877635
> > 11 1 2.874872545380876E-4
> > 11 7 1.874872545380876E-4
> > 12 1 6.0017749742760714085
> > 12 7 0.00237749742760714085
> > 13 1 12.0
> > 14 1 30.56066514110724
> > 15 2 4.0
> > ---- truncated (numbers randomly modified)
> >
> > According to the documentation on
> >
> >
>
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
>
> > , the first column of the matrix represents statistics type (minimum,
> > mean, etc.), the second column represents variable ID and the last
> column
> > gives the statistics value.
> >
> > While the documentation is very clear and the results are consistent
> with
> > outputs of other software like R, I found the format a bit inconvenient
> > since I have to refer to the reference Table (table 1 in aforementioned
> > link) to understand the summary statistics.
> >
> > I understand that the pure numeric matrix format is easy to use as
> machine
> > input for future steps. An additional table that is more human readable
> > would be nice since the main purpose of uni-variate statistics is often
> > exploratory data analysis and a clear summary is essential.
> >
> > Suggestions to consider in the readable summary if there's not already
> > one:
> > 1. Order the rows according to variables (column 2) instead of
> statistics
> > type (column 1), so that summary statistics of the same variable are
> > grouped together.
> > 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead
> of
> > IDs (1, 2, etc).
> > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs
> (1,2,
> > etc).
> > 4. Use level labels for categorical predictors ("male", "female", etc)
> > instead of IDs (1,2, etc).
> > 5. Add counts of cases in each level for categorical variable in
> addition
> > to modes. This gives the distribution information of the variable.
> > 6. If the amount of data in the summary is manageable perhaps
> > automatically pull the output of Univar-Stats.dml from HDFS to local
> > machine and display the readable version on terminal?
> >
> > So the output could look like:
> >
> > age min 10
> > age max 123
> > age range 113
> > age mean 60
> > ...
> > gender female.count 1000
> > gender male.count 2000
> > gender mode male
> > ...
> >
> > or even a table format like in R:
> >
> > age                  gender
> > min    10          female 1000
> > max   123        male 2000
> > range 113        mode male
> > mean  60         ...
> > ...
> > Thanks much,
> >
> > Ethan Xu
> >
> >
>
>
>
>
>

Re: User friendly output of univariate statistics

Posted by Ethan Xu <et...@us.ibm.com>.
Thank you Deron. From my personal experience printing a single type of 
user-friendly result on console is usually enough for a quick inspection. 
However that's in an interactive environment (like R interactive session), 
where recreating the printout is simple. 
 
Since calling a dml scrip on hadoop might trigger a MapReduce job maybe 
it's better to save the user-friendly version as a file too? Or perhaps 
it's helpful to have a script that takes the original summary (plus some 
metadata) as input, and produces the user-friendly output?
 
Best,
 
Ethan



From:   Deron Eriksson <de...@gmail.com>
To:     dev@systemml.incubator.apache.org
Date:   02/03/2016 01:13 AM
Subject:        Re: User friendly output of univariate statistics



Hi Ethan,

I think you make a great point with regards to the readability of the
output from Univar-Stats.dml.

Do you think outputting the user-friendly results in the format you
describe to the console while still writing the more mathematical results
to a file would be the type of behavior that you would find most useful? 
Or
would you also like to see the user-friendly results also sent to a file?

Also, I was wondering, do you think a single user-friendly format is
sufficient, or do you think that data scientists would like (or expect) to
be able to have multiple formats such as you described?

The table format is very interesting. Currently DML has a basic print
statement, but I don't believe it can be used to format data into columns,
such as in your table format example. It might be very nice to add a
c-style "printf" statement, which would allow results to be written to the
console in a more columnar format.

Does anyone else have any thoughts?

Deron


On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <et...@us.ibm.com> wrote:

> dml is quite amazing. I was wondering if there is a user friendly (more
> human readable) version of outputs from Univar-Stats.dml? I ran the
> Univar-Stats.dml on my data set that contains 7 variables: two 
continuous,
> one categorical. The output is a csv file on HDFS that looks like this:
>
> 1 1 10.0
> 2 1 123.0
> 2 7 469.0
> 3 1 122.0
> 3 7 419.0
> 4 1 34.852512104922082
> 4 7 0.40786451178676335
> 5 1 613.6600902369631
> 5 7 1.5322171660886
> 6 1 25.566777079580508
> 6 7 5.54382044429201915
> 7 1 0.219263232610989764
> 7 7 12.14558700418414E-4
> 8 1 0.5323447433694138
> 8 7 1.23151883029726626
> 9 1 0.28352047550156284
> 9 7 23.25049533659206
> 10 1 -0.5348573740280274
> 10 7 2023.294658877635
> 11 1 2.874872545380876E-4
> 11 7 1.874872545380876E-4
> 12 1 6.0017749742760714085
> 12 7 0.00237749742760714085
> 13 1 12.0
> 14 1 30.56066514110724
> 15 2 4.0
> ---- truncated (numbers randomly modified)
>
> According to the documentation on
>
> 
http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics

> , the first column of the matrix represents statistics type (minimum,
> mean, etc.), the second column represents variable ID and the last 
column
> gives the statistics value.
>
> While the documentation is very clear and the results are consistent 
with
> outputs of other software like R, I found the format a bit inconvenient
> since I have to refer to the reference Table (table 1 in aforementioned
> link) to understand the summary statistics.
>
> I understand that the pure numeric matrix format is easy to use as 
machine
> input for future steps. An additional table that is more human readable
> would be nice since the main purpose of uni-variate statistics is often
> exploratory data analysis and a clear summary is essential.
>
> Suggestions to consider in the readable summary if there's not already
> one:
> 1. Order the rows according to variables (column 2) instead of 
statistics
> type (column 1), so that summary statistics of the same variable are
> grouped together.
> 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead 
of
> IDs (1, 2, etc).
> 3. Use actual predictor labels ("age", "gender", etc) instead of IDs 
(1,2,
> etc).
> 4. Use level labels for categorical predictors ("male", "female", etc)
> instead of IDs (1,2, etc).
> 5. Add counts of cases in each level for categorical variable in 
addition
> to modes. This gives the distribution information of the variable.
> 6. If the amount of data in the summary is manageable perhaps
> automatically pull the output of Univar-Stats.dml from HDFS to local
> machine and display the readable version on terminal?
>
> So the output could look like:
>
> age min 10
> age max 123
> age range 113
> age mean 60
> ...
> gender female.count 1000
> gender male.count 2000
> gender mode male
> ...
>
> or even a table format like in R:
>
> age                  gender
> min    10          female 1000
> max   123        male 2000
> range 113        mode male
> mean  60         ...
> ...
> Thanks much,
>
> Ethan Xu
>
>





Re: User friendly output of univariate statistics

Posted by Deron Eriksson <de...@gmail.com>.
Hi Ethan,

I think you make a great point with regards to the readability of the
output from Univar-Stats.dml.

Do you think outputting the user-friendly results in the format you
describe to the console while still writing the more mathematical results
to a file would be the type of behavior that you would find most useful? Or
would you also like to see the user-friendly results also sent to a file?

Also, I was wondering, do you think a single user-friendly format is
sufficient, or do you think that data scientists would like (or expect) to
be able to have multiple formats such as you described?

The table format is very interesting. Currently DML has a basic print
statement, but I don't believe it can be used to format data into columns,
such as in your table format example. It might be very nice to add a
c-style "printf" statement, which would allow results to be written to the
console in a more columnar format.

Does anyone else have any thoughts?

Deron


On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <et...@us.ibm.com> wrote:

> dml is quite amazing. I was wondering if there is a user friendly (more
> human readable) version of outputs from Univar-Stats.dml? I ran the
> Univar-Stats.dml on my data set that contains 7 variables: two continuous,
> one categorical. The output is a csv file on HDFS that looks like this:
>
> 1 1 10.0
> 2 1 123.0
> 2 7 469.0
> 3 1 122.0
> 3 7 419.0
> 4 1 34.852512104922082
> 4 7 0.40786451178676335
> 5 1 613.6600902369631
> 5 7 1.5322171660886
> 6 1 25.566777079580508
> 6 7 5.54382044429201915
> 7 1 0.219263232610989764
> 7 7 12.14558700418414E-4
> 8 1 0.5323447433694138
> 8 7 1.23151883029726626
> 9 1 0.28352047550156284
> 9 7 23.25049533659206
> 10 1 -0.5348573740280274
> 10 7 2023.294658877635
> 11 1 2.874872545380876E-4
> 11 7 1.874872545380876E-4
> 12 1 6.0017749742760714085
> 12 7 0.00237749742760714085
> 13 1 12.0
> 14 1 30.56066514110724
> 15 2 4.0
> ---- truncated (numbers randomly modified)
>
> According to the documentation on
>
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
> , the first column of the matrix represents statistics type (minimum,
> mean, etc.), the second column represents variable ID and the last column
> gives the statistics value.
>
> While the documentation is very clear and the results are consistent with
> outputs of other software like R, I found the format a bit inconvenient
> since I have to refer to the reference Table (table 1 in aforementioned
> link) to understand the summary statistics.
>
> I understand that the pure numeric matrix format is easy to use as machine
> input for future steps. An additional table that is more human readable
> would be nice since the main purpose of uni-variate statistics is often
> exploratory data analysis and a clear summary is essential.
>
> Suggestions to consider in the readable summary if there's not already
> one:
> 1. Order the rows according to variables (column 2) instead of statistics
> type (column 1), so that summary statistics of the same variable are
> grouped together.
> 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead of
> IDs (1, 2, etc).
> 3. Use actual predictor labels ("age", "gender", etc) instead of IDs (1,2,
> etc).
> 4. Use level labels for categorical predictors ("male", "female", etc)
> instead of IDs (1,2, etc).
> 5. Add counts of cases in each level for categorical variable in addition
> to modes. This gives the distribution information of the variable.
> 6. If the amount of data in the summary is manageable perhaps
> automatically pull the output of Univar-Stats.dml from HDFS to local
> machine and display the readable version on terminal?
>
> So the output could look like:
>
> age min 10
> age max 123
> age range 113
> age mean 60
> ...
> gender female.count 1000
> gender male.count 2000
> gender mode male
> ...
>
> or even a table format like in R:
>
> age                  gender
> min    10          female 1000
> max   123        male 2000
> range 113        mode male
> mean  60         ...
> ...
> Thanks much,
>
> Ethan Xu
>
>