You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemml.apache.org by Ethan Xu <et...@gmail.com> on 2016/04/14 22:37:14 UTC

'sample.dml' replaces rows with 0's

Hello,

I encountered an unexpected behavior from 'sample.dml' on a dataset on
Hadoop. Instead of splitting the data, it replaced rows of original data
with 0's. Here are the details:

I called sample.dml in attempt to split is a 35 million by 2396 numeric
matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
still contain 35 million rows, instead of 35*80% and 35*20% rows.

However it looks like 20% of the rows in '1' are replaced with 0's (but not
removed). It is as if line 66 of sample.dml (
https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
that calls removeEmpty() doesn't exist.

Here is the submission script:

printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": 1,
"format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd

## Split file to training and test sets
hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
-config=$sysConfCust -nvargs X=/path/originalData.csv
sv=/path/split-perc.csv O=/path/train-test ofmt=csv


There was no error messages and all MR jobs were executed successfully.
What other information can I provide to diagnose the issue?

Thanks,

Ethan

Re: 'sample.dml' replaces rows with 0's

Posted by Matthias Boehm <mb...@us.ibm.com>.

well, it looks like an issue of incorrect meta data propagation (wrong
propagation of dimensions through mr pmm instructions). The data itself
looks good if I write a 20% sample to textcell (what is used in our
testsuite).

@Shirish: thanks for looking into it. Just fyi, while testing this on an
ultra-sparse scenario, I also encountered a runtime issue of deep copying
sparse rows (fix will be available tomorrow), so for now don't worry about
it if you encounter the same issue.

Regards,
Matthias




From:	Shirish Tatikonda <sh...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	04/14/2016 08:43 PM
Subject:	Re: 'sample.dml' replaces rows with 0's



Hi Ethan,

I just tried the script on a toy data and I could reproduce this erroneous
behavior when run in Hadoop mode -- both local and Spark modes are good. I
will look into it.

BTW, you forgot to attach the scripts.

Shirish

On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <et...@gmail.com> wrote:

> OK this is interesting:
>
> Scenario 1
> I slightly modified 'sample.dml' to add statements to print dimensions of
> SM, P and iX, and ran it on the same data. The dimensions AND the output
> were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> original data.
>
> Please see attached:
> sample-debug.dml:
> sample.dml with 3 print functions inserted
> train-test-debug_1.mtd
> train-test-debug_2.mtd:
> meta data of outputs. Note 'rows' are correct.
>
>
> Scenario 2
> This is confusing so I commented out the 'print' statements in
> 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> That is, subset '1' and '2' contain the same rows as the original data.
>
> Please see attached:
> Please see attached:
> sample-debug-noprint.dml:
> 3 print functions were commented out
> train-test-debug-noprint_1.mtd
> train-test-debug-noprint_2.mtd
> meta data of outputs. Note 'rows' are incorrect.
>
> There was no errors in either trials.
>
> Ethan
>
> On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <et...@gmail.com>
wrote:
>
>> Hello,
>>
>> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> Hadoop. Instead of splitting the data, it replaced rows of original data
>> with 0's. Here are the details:
>>
>> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2'
both
>> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>>
>> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> not removed). It is as if line 66 of sample.dml (
>>
https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml
)
>> that calls removeEmpty() doesn't exist.
>>
>> Here is the submission script:
>>
>> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>>
>> ## Split file to training and test sets
>> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>>
>>
>> There was no error messages and all MR jobs were executed successfully.
>> What other information can I provide to diagnose the issue?
>>
>> Thanks,
>>
>> Ethan
>>
>>
>>
>>
>>
>

Re: 'sample.dml' replaces rows with 0's

Posted by Ethan Xu <et...@gmail.com>.

Another attempt to attach scripts.

On Fri, Apr 15, 2016 at 11:51 AM, Ethan Xu <et...@gmail.com> wrote:

> Thank you Shirish and Matthias for looking into this issue. I got some
> small updates from more runs.
>
> Shirish, Hmm my browser told me that the scripts were attached. There must
> be some connection issue.  I attached them again to this email. Hope they
> got through this time. I also tested the same scripts on small toy data in
> local mode and they behaved correctly.
>
> Matthias you mentioned in your testsuite the metadata was incorrect but
> the dataset itself looked OK. In my case both the metadata and the data
> seem to be incorrect. Here is how this was confirmed:
>
> The output of sample-debug-noprint.dml (attached) contains 4 files:
> "1", "1.mtd" (attached as train-test-debug-noprint-1.mtd), "2", "2.mtd"
> (attached as train-test-debug-noprint-1.mtd).
> The auto generated metadata indicates there are 35478061 rows in "1".
>
>    1. I replaced the automatically generated metadata file of "1" with a
>    generic one (attached as 1-generic.mtd) which does not specify the number
>    of rows.
>    2. I ran a script (attached "countzeros.dml") to find the number of
>    rows, as well as the number of 0's in each column of "1". The script
>    returned that there were 35479057 rows in "1", which was 996 more than
>    what's shown in the metadata (???).
>    3. I ran the same script to count rows and 0's of the original data
>    set on which 'sample-debug-print.dml' was run. The number of rows was
>    35478061.
>    4. I found the difference of the number of 0's (by column) between the
>    the original data and "1". The columns that contained no 0's in the
>    original data set had 7099710 zeros in "1", which is roughly 20% of row
>    counts.
>    5. Therefore it still looks like for some reason
>    'sample-debug-noprint.dml' did randomly replaced 20% of rows with 0's but
>    didn't remove them. Also the sizes of the original data and "1" are 178G
>    and 186.3G on HDFS.
>
> I did use a custom configuration for all the submissions. The
> configuration file is also attached.
>
> Thanks,
>
> Ethan
>
>
>
>
>
>
>
> On Fri, Apr 15, 2016 at 12:41 AM, Matthias Boehm <mb...@us.ibm.com>
> wrote:
>
>> well, it looks like an issue of incorrect meta data propagation (wrong
>> propagation of dimensions through mr pmm instructions). The data itself
>> looks good if I write a 20% sample to textcell (what is used in our
>> testsuite).
>>
>> @Shirish: thanks for looking into it. Just fyi, while testing this on an
>> ultra-sparse scenario, I also encountered a runtime issue of deep copying
>> sparse rows (fix will be available tomorrow), so for now don't worry about
>> it if you encounter the same issue.
>>
>> Regards,
>> Matthias
>>
>>
>> [image: Inactive hide details for Shirish Tatikonda ---04/14/2016
>> 08:43:34 PM---Hi Ethan, I just tried the script on a toy data and I c]Shirish
>> Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on
>> a toy data and I could reproduce this erroneous
>>
>> From: Shirish Tatikonda <sh...@gmail.com>
>> To: dev@systemml.incubator.apache.org
>> Date: 04/14/2016 08:43 PM
>> Subject: Re: 'sample.dml' replaces rows with 0's
>> ------------------------------
>>
>>
>>
>> Hi Ethan,
>>
>> I just tried the script on a toy data and I could reproduce this erroneous
>> behavior when run in Hadoop mode -- both local and Spark modes are good. I
>> will look into it.
>>
>> BTW, you forgot to attach the scripts.
>>
>> Shirish
>>
>> On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <et...@gmail.com>
>> wrote:
>>
>> > OK this is interesting:
>> >
>> > Scenario 1
>> > I slightly modified 'sample.dml' to add statements to print dimensions
>> of
>> > SM, P and iX, and ran it on the same data. The dimensions AND the output
>> > were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
>> > original data.
>> >
>> > Please see attached:
>> > sample-debug.dml:
>> > sample.dml with 3 print functions inserted
>> > train-test-debug_1.mtd
>> > train-test-debug_2.mtd:
>> > meta data of outputs. Note 'rows' are correct.
>> >
>> >
>> > Scenario 2
>> > This is confusing so I commented out the 'print' statements in
>> > 'sample.dml' and ran it on the same data, and the output were INCORRECT.
>> > That is, subset '1' and '2' contain the same rows as the original data.
>> >
>> > Please see attached:
>> > Please see attached:
>> > sample-debug-noprint.dml:
>> > 3 print functions were commented out
>> > train-test-debug-noprint_1.mtd
>> > train-test-debug-noprint_2.mtd
>> > meta data of outputs. Note 'rows' are incorrect.
>> >
>> > There was no errors in either trials.
>> >
>> > Ethan
>> >
>> > On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <et...@gmail.com>
>> wrote:
>> >
>> >> Hello,
>> >>
>> >> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> >> Hadoop. Instead of splitting the data, it replaced rows of original
>> data
>> >> with 0's. Here are the details:
>> >>
>> >> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> >> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2'
>> both
>> >> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>> >>
>> >> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> >> not removed). It is as if line 66 of sample.dml (
>> >>
>> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml
>> )
>> >> that calls removeEmpty() doesn't exist.
>> >>
>> >> Here is the submission script:
>> >>
>> >> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> >> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> >> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>> >>
>> >> ## Split file to training and test sets
>> >> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> >> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> >> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>> >>
>> >>
>> >> There was no error messages and all MR jobs were executed successfully.
>> >> What other information can I provide to diagnose the issue?
>> >>
>> >> Thanks,
>> >>
>> >> Ethan
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>

Re: 'sample.dml' replaces rows with 0's

Posted by Ethan Xu <et...@gmail.com>.

Thank you Shirish and Matthias for looking into this issue. I got some
small updates from more runs.

Shirish, Hmm my browser told me that the scripts were attached. There must
be some connection issue.  I attached them again to this email. Hope they
got through this time. I also tested the same scripts on small toy data in
local mode and they behaved correctly.

Matthias you mentioned in your testsuite the metadata was incorrect but the
dataset itself looked OK. In my case both the metadata and the data seem to
be incorrect. Here is how this was confirmed:

The output of sample-debug-noprint.dml (attached) contains 4 files:
"1", "1.mtd" (attached as train-test-debug-noprint-1.mtd), "2", "2.mtd"
(attached as train-test-debug-noprint-1.mtd).
The auto generated metadata indicates there are 35478061 rows in "1".

   1. I replaced the automatically generated metadata file of "1" with a
   generic one (attached as 1-generic.mtd) which does not specify the number
   of rows.
   2. I ran a script (attached "countzeros.dml") to find the number of
   rows, as well as the number of 0's in each column of "1". The script
   returned that there were 35479057 rows in "1", which was 996 more than
   what's shown in the metadata (???).
   3. I ran the same script to count rows and 0's of the original data set
   on which 'sample-debug-print.dml' was run. The number of rows was 35478061.
   4. I found the difference of the number of 0's (by column) between the
   the original data and "1". The columns that contained no 0's in the
   original data set had 7099710 zeros in "1", which is roughly 20% of row
   counts.
   5. Therefore it still looks like for some reason
   'sample-debug-noprint.dml' did randomly replaced 20% of rows with 0's but
   didn't remove them. Also the sizes of the original data and "1" are 178G
   and 186.3G on HDFS.

I did use a custom configuration for all the submissions. The configuration
file is also attached.

Thanks,

Ethan

On Fri, Apr 15, 2016 at 12:41 AM, Matthias Boehm <mb...@us.ibm.com> wrote:

> well, it looks like an issue of incorrect meta data propagation (wrong
> propagation of dimensions through mr pmm instructions). The data itself
> looks good if I write a 20% sample to textcell (what is used in our
> testsuite).
>
> @Shirish: thanks for looking into it. Just fyi, while testing this on an
> ultra-sparse scenario, I also encountered a runtime issue of deep copying
> sparse rows (fix will be available tomorrow), so for now don't worry about
> it if you encounter the same issue.
>
> Regards,
> Matthias
>
>
> [image: Inactive hide details for Shirish Tatikonda ---04/14/2016 08:43:34
> PM---Hi Ethan, I just tried the script on a toy data and I c]Shirish
> Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on
> a toy data and I could reproduce this erroneous
>
> From: Shirish Tatikonda <sh...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/14/2016 08:43 PM
> Subject: Re: 'sample.dml' replaces rows with 0's
> ------------------------------
>
>
>
> Hi Ethan,
>
> I just tried the script on a toy data and I could reproduce this erroneous
> behavior when run in Hadoop mode -- both local and Spark modes are good. I
> will look into it.
>
> BTW, you forgot to attach the scripts.
>
> Shirish
>
> On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <et...@gmail.com> wrote:
>
> > OK this is interesting:
> >
> > Scenario 1
> > I slightly modified 'sample.dml' to add statements to print dimensions of
> > SM, P and iX, and ran it on the same data. The dimensions AND the output
> > were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> > original data.
> >
> > Please see attached:
> > sample-debug.dml:
> > sample.dml with 3 print functions inserted
> > train-test-debug_1.mtd
> > train-test-debug_2.mtd:
> > meta data of outputs. Note 'rows' are correct.
> >
> >
> > Scenario 2
> > This is confusing so I commented out the 'print' statements in
> > 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> > That is, subset '1' and '2' contain the same rows as the original data.
> >
> > Please see attached:
> > Please see attached:
> > sample-debug-noprint.dml:
> > 3 print functions were commented out
> > train-test-debug-noprint_1.mtd
> > train-test-debug-noprint_2.mtd
> > meta data of outputs. Note 'rows' are incorrect.
> >
> > There was no errors in either trials.
> >
> > Ethan
> >
> > On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <et...@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >> I encountered an unexpected behavior from 'sample.dml' on a dataset on
> >> Hadoop. Instead of splitting the data, it replaced rows of original data
> >> with 0's. Here are the details:
> >>
> >> I called sample.dml in attempt to split is a 35 million by 2396 numeric
> >> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2'
> both
> >> still contain 35 million rows, instead of 35*80% and 35*20% rows.
> >>
> >> However it looks like 20% of the rows in '1' are replaced with 0's (but
> >> not removed). It is as if line 66 of sample.dml (
> >>
> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml
> )
> >> that calls removeEmpty() doesn't exist.
> >>
> >> Here is the submission script:
> >>
> >> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
> >> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
> >> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
> >>
> >> ## Split file to training and test sets
> >> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
> >> -config=$sysConfCust -nvargs X=/path/originalData.csv
> >> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
> >>
> >>
> >> There was no error messages and all MR jobs were executed successfully.
> >> What other information can I provide to diagnose the issue?
> >>
> >> Thanks,
> >>
> >> Ethan
> >>
> >>
> >>
> >>
> >>
> >
>
>
>

Re: 'sample.dml' replaces rows with 0's

Posted by Shirish Tatikonda <sh...@gmail.com>.

Hi Ethan,

I just tried the script on a toy data and I could reproduce this erroneous
behavior when run in Hadoop mode -- both local and Spark modes are good. I
will look into it.

BTW, you forgot to attach the scripts.

Shirish

On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <et...@gmail.com> wrote:

> OK this is interesting:
>
> Scenario 1
> I slightly modified 'sample.dml' to add statements to print dimensions of
> SM, P and iX, and ran it on the same data. The dimensions AND the output
> were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> original data.
>
> Please see attached:
> sample-debug.dml:
> sample.dml with 3 print functions inserted
> train-test-debug_1.mtd
> train-test-debug_2.mtd:
> meta data of outputs. Note 'rows' are correct.
>
>
> Scenario 2
> This is confusing so I commented out the 'print' statements in
> 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> That is, subset '1' and '2' contain the same rows as the original data.
>
> Please see attached:
> Please see attached:
> sample-debug-noprint.dml:
> 3 print functions were commented out
> train-test-debug-noprint_1.mtd
> train-test-debug-noprint_2.mtd
> meta data of outputs. Note 'rows' are incorrect.
>
> There was no errors in either trials.
>
> Ethan
>
> On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <et...@gmail.com> wrote:
>
>> Hello,
>>
>> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> Hadoop. Instead of splitting the data, it replaced rows of original data
>> with 0's. Here are the details:
>>
>> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
>> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>>
>> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> not removed). It is as if line 66 of sample.dml (
>> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
>> that calls removeEmpty() doesn't exist.
>>
>> Here is the submission script:
>>
>> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>>
>> ## Split file to training and test sets
>> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>>
>>
>> There was no error messages and all MR jobs were executed successfully.
>> What other information can I provide to diagnose the issue?
>>
>> Thanks,
>>
>> Ethan
>>
>>
>>
>>
>>
>

Re: 'sample.dml' replaces rows with 0's

Posted by Ethan Xu <et...@gmail.com>.

OK this is interesting:

Scenario 1
I slightly modified 'sample.dml' to add statements to print dimensions of
SM, P and iX, and ran it on the same data. The dimensions AND the output
were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
original data.

Please see attached:
sample-debug.dml:
sample.dml with 3 print functions inserted
train-test-debug_1.mtd
train-test-debug_2.mtd:
meta data of outputs. Note 'rows' are correct.

Scenario 2
This is confusing so I commented out the 'print' statements in 'sample.dml'
and ran it on the same data, and the output were INCORRECT. That is, subset
'1' and '2' contain the same rows as the original data.

Please see attached:
Please see attached:
sample-debug-noprint.dml:
3 print functions were commented out
train-test-debug-noprint_1.mtd
train-test-debug-noprint_2.mtd
meta data of outputs. Note 'rows' are incorrect.

There was no errors in either trials.

Ethan

On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <et...@gmail.com> wrote:

> Hello,
>
> I encountered an unexpected behavior from 'sample.dml' on a dataset on
> Hadoop. Instead of splitting the data, it replaced rows of original data
> with 0's. Here are the details:
>
> I called sample.dml in attempt to split is a 35 million by 2396 numeric
> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>
> However it looks like 20% of the rows in '1' are replaced with 0's (but
> not removed). It is as if line 66 of sample.dml (
> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
> that calls removeEmpty() doesn't exist.
>
> Here is the submission script:
>
> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": 1,
> "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>
> ## Split file to training and test sets
> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
> -config=$sysConfCust -nvargs X=/path/originalData.csv
> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>
>
> There was no error messages and all MR jobs were executed successfully.
> What other information can I provide to diagnose the issue?
>
> Thanks,
>
> Ethan
>
>
>
>
>