You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Ethan Xu <et...@gmail.com> on 2016/04/14 22:33:56 UTC

parfor fails

Hello,

I have a quick question. The following script fails with this error:

org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
loop in parallel.

Here is the dml script:

x=read($X);

print("number of rows of x = " + nrow(x));
print("number of cols of x = " + ncol(x));

parfor(i in 1:ncol(x), check=0){
    a = x[,i];
    print("number of 0's in col " + i + " = " + sum(a == 0));
}

where X is a 35 million by 2396 matrix (coded and dummy coded numerical
matrix) on HDFS. The script runs fine with regular 'for' loops.

Could someone explain why this script cannot run in parallel? Was it a
wrong way to code parfor?

Thanks,

Ethan

Re: parfor fails

Posted by Matthias Boehm <mb...@us.ibm.com>.
Thanks for following up on this Ethan. As a side note, meanwhile
SYSTEMML-635 has been resolved, so you don't necessarily need a
SystemML-config file anymore.

Furthermore, this new issue is understandable too. SystemML uses a local
tmp directory wherever we run our CP (singlenode in-memory) operations,
e.g., for evictions if necessary. Normally, this is the driver process (on
the node where you invoke SystemML) but also so-called remote parfor jobs,
where we run our CP operations in each task and thus potentially any node
of the cluster. Most likely you don't have permissions to create the
specified directory there. Could you please try to use /tmp/systemml or any
other directory where you have write access to workaround this?

Regards,
Matthias




From:	Ethan Xu <et...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	04/16/2016 04:13 PM
Subject:	Re: parfor fails



Hi Matthias,

Thank you very much for the explanation and a better solution. s = colSums
(x==0) is more concise and works great!

For experiment I tried the original parfor script with SystemML
configuration file provided. On my cluster it's still failing with "PARFOR:
Failed to execute loop in parallel". It looks like the failed MR jobs are
caused by

Caused by: org.apache.sysml.runtime.DMLRuntimeException: Failed to create
non-existing local working directory:/path.to/ethan.xu/tmp/systemml

That directory '/path.to/ethan.xu/tmp/systemml' exists on the local server,
and it subdirectories named '_p22748_127.0.0.1' etc. It looks like other
SystemML jobs had no trouble writing to it.

The stderr and one failed MR log is attached.

Thanks,

Ethan

On Thu, Apr 14, 2016 at 11:14 PM, Matthias Boehm <mb...@us.ibm.com> wrote:
  just for completeness, this issue is tracked with
  https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
  available tomorrow.

  Regards,
  Matthias

  Matthias Boehm---04/14/2016 07:53:43 PM---Hi Ethan, thanks for catching
  this issue. The parfor script itself is perfectly fine

  From: Matthias Boehm/Almaden/IBM@IBMUS
  To: dev@systemml.incubator.apache.org
  Cc: "Ethan Xu" <et...@gmail.com>
  Date: 04/14/2016 07:53 PM
  Subject: Re: parfor fails



  Hi Ethan,

  thanks for catching this issue. The parfor script itself is perfectly
  fine but you encountered an interesting runtime bug. Usually, you can
  find the actual cause at the bottom of the stacktrace or in previous
  exceptions. I was able to reproduce this issue if NO systemml config file
  is provided (fails on parsing this non-existing config in the parfor mr
  job task setup). So the workaround is to put a SystemML-config.xml into
  the same directory. Interestingly, the issue did not show up in our
  testsuite because we always specify a default configuration there (which
  was until recently mandatory).

  As a side note, we strongly recommend parfor over for loops here because
  it runs the entire loop in 1 instead of 2396 MR jobs due to automatic
  data partitioning. However, for the specific example at hand, a
  data-parallel formulation (with "s = colSums(x==0)") would be even better
  as it allows for partial aggregation and hence reduces shuffle.

  Regards,
  Matthias

  Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
  following script fails with this error:

  From: Ethan Xu <et...@gmail.com>
  To: dev@systemml.incubator.apache.org
  Date: 04/14/2016 01:34 PM
  Subject: parfor fails



  Hello,

  I have a quick question. The following script fails with this error:

  org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
  loop in parallel.

  Here is the dml script:

  x=read($X);

  print("number of rows of x = " + nrow(x));
  print("number of cols of x = " + ncol(x));

  parfor(i in 1:ncol(x), check=0){
    a = x[,i];
    print("number of 0's in col " + i + " = " + sum(a == 0));
  }

  where X is a 35 million by 2396 matrix (coded and dummy coded numerical
  matrix) on HDFS. The script runs fine with regular 'for' loops.

  Could someone explain why this script cannot run in parallel? Was it a
  wrong way to code parfor?

  Thanks,

  Ethan







On Thu, Apr 14, 2016 at 11:14 PM, Matthias Boehm <mb...@us.ibm.com> wrote:
  just for completeness, this issue is tracked with
  https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
  available tomorrow.

  Regards,
  Matthias

  Matthias Boehm---04/14/2016 07:53:43 PM---Hi Ethan, thanks for catching
  this issue. The parfor script itself is perfectly fine

  From: Matthias Boehm/Almaden/IBM@IBMUS
  To: dev@systemml.incubator.apache.org
  Cc: "Ethan Xu" <et...@gmail.com>
  Date: 04/14/2016 07:53 PM
  Subject: Re: parfor fails



  Hi Ethan,

  thanks for catching this issue. The parfor script itself is perfectly
  fine but you encountered an interesting runtime bug. Usually, you can
  find the actual cause at the bottom of the stacktrace or in previous
  exceptions. I was able to reproduce this issue if NO systemml config file
  is provided (fails on parsing this non-existing config in the parfor mr
  job task setup). So the workaround is to put a SystemML-config.xml into
  the same directory. Interestingly, the issue did not show up in our
  testsuite because we always specify a default configuration there (which
  was until recently mandatory).

  As a side note, we strongly recommend parfor over for loops here because
  it runs the entire loop in 1 instead of 2396 MR jobs due to automatic
  data partitioning. However, for the specific example at hand, a
  data-parallel formulation (with "s = colSums(x==0)") would be even better
  as it allows for partial aggregation and hence reduces shuffle.

  Regards,
  Matthias

  Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
  following script fails with this error:

  From: Ethan Xu <et...@gmail.com>
  To: dev@systemml.incubator.apache.org
  Date: 04/14/2016 01:34 PM
  Subject: parfor fails



  Hello,

  I have a quick question. The following script fails with this error:

  org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
  loop in parallel.

  Here is the dml script:

  x=read($X);

  print("number of rows of x = " + nrow(x));
  print("number of cols of x = " + ncol(x));

  parfor(i in 1:ncol(x), check=0){
    a = x[,i];
    print("number of 0's in col " + i + " = " + sum(a == 0));
  }

  where X is a 35 million by 2396 matrix (coded and dummy coded numerical
  matrix) on HDFS. The script runs fine with regular 'for' loops.

  Could someone explain why this script cannot run in parallel? Was it a
  wrong way to code parfor?

  Thanks,

  Ethan





[attachment "num-0-error-log2.txt" deleted by Matthias Boehm/Almaden/IBM]
[attachment "reduce-log.txt" deleted by Matthias Boehm/Almaden/IBM]



Re: parfor fails

Posted by Ethan Xu <et...@gmail.com>.
Hi Matthias,

Thank you very much for the explanation and a better solution. s =
colSums(x==0) is more concise and works great!

For experiment I tried the original parfor script with SystemML
configuration file provided. On my cluster it's still failing with "PARFOR:
Failed to execute loop in parallel". It looks like the failed MR jobs are
caused by

Caused by: org.apache.sysml.runtime.DMLRuntimeException: Failed to create
non-existing local working directory:/path.to/ethan.xu/tmp/systemml

That directory '/path.to/ethan.xu/tmp/systemml' exists on the local server,
and it subdirectories named '_p22748_127.0.0.1' etc. It looks like other
SystemML jobs had no trouble writing to it.

The stderr and one failed MR log is attached.

Thanks,

Ethan

On Thu, Apr 14, 2016 at 11:14 PM, Matthias Boehm <mb...@us.ibm.com> wrote:

> just for completeness, this issue is tracked with
> https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
> available tomorrow.
>
> Regards,
> Matthias
>
> [image: Inactive hide details for Matthias Boehm---04/14/2016 07:53:43
> PM---Hi Ethan, thanks for catching this issue. The parfor script]Matthias
> Boehm---04/14/2016 07:53:43 PM---Hi Ethan, thanks for catching this issue.
> The parfor script itself is perfectly fine
>
> From: Matthias Boehm/Almaden/IBM@IBMUS
> To: dev@systemml.incubator.apache.org
> Cc: "Ethan Xu" <et...@gmail.com>
> Date: 04/14/2016 07:53 PM
> Subject: Re: parfor fails
> ------------------------------
>
>
>
> Hi Ethan,
>
> thanks for catching this issue. The parfor script itself is perfectly fine
> but you encountered an interesting runtime bug. Usually, you can find the
> actual cause at the bottom of the stacktrace or in previous exceptions. I
> was able to reproduce this issue if NO systemml config file is provided
> (fails on parsing this non-existing config in the parfor mr job task
> setup). So the workaround is to put a SystemML-config.xml into the same
> directory. Interestingly, the issue did not show up in our testsuite
> because we always specify a default configuration there (which was until
> recently mandatory).
>
> As a side note, we strongly recommend parfor over for loops here because
> it runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
> partitioning. However, for the specific example at hand, a data-parallel
> formulation (with "s = colSums(x==0)") would be even better as it allows
> for partial aggregation and hence reduces shuffle.
>
> Regards,
> Matthias
>
> Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
> following script fails with this error:
>
> From: Ethan Xu <et...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/14/2016 01:34 PM
> Subject: parfor fails
> ------------------------------
>
>
>
> Hello,
>
> I have a quick question. The following script fails with this error:
>
> org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
> loop in parallel.
>
> Here is the dml script:
>
> x=read($X);
>
> print("number of rows of x = " + nrow(x));
> print("number of cols of x = " + ncol(x));
>
> parfor(i in 1:ncol(x), check=0){
>   a = x[,i];
>   print("number of 0's in col " + i + " = " + sum(a == 0));
> }
>
> where X is a 35 million by 2396 matrix (coded and dummy coded numerical
> matrix) on HDFS. The script runs fine with regular 'for' loops.
>
> Could someone explain why this script cannot run in parallel? Was it a
> wrong way to code parfor?
>
> Thanks,
>
> Ethan
>
>
>
>

On Thu, Apr 14, 2016 at 11:14 PM, Matthias Boehm <mb...@us.ibm.com> wrote:

> just for completeness, this issue is tracked with
> https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
> available tomorrow.
>
> Regards,
> Matthias
>
> [image: Inactive hide details for Matthias Boehm---04/14/2016 07:53:43
> PM---Hi Ethan, thanks for catching this issue. The parfor script]Matthias
> Boehm---04/14/2016 07:53:43 PM---Hi Ethan, thanks for catching this issue.
> The parfor script itself is perfectly fine
>
> From: Matthias Boehm/Almaden/IBM@IBMUS
> To: dev@systemml.incubator.apache.org
> Cc: "Ethan Xu" <et...@gmail.com>
> Date: 04/14/2016 07:53 PM
> Subject: Re: parfor fails
> ------------------------------
>
>
>
> Hi Ethan,
>
> thanks for catching this issue. The parfor script itself is perfectly fine
> but you encountered an interesting runtime bug. Usually, you can find the
> actual cause at the bottom of the stacktrace or in previous exceptions. I
> was able to reproduce this issue if NO systemml config file is provided
> (fails on parsing this non-existing config in the parfor mr job task
> setup). So the workaround is to put a SystemML-config.xml into the same
> directory. Interestingly, the issue did not show up in our testsuite
> because we always specify a default configuration there (which was until
> recently mandatory).
>
> As a side note, we strongly recommend parfor over for loops here because
> it runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
> partitioning. However, for the specific example at hand, a data-parallel
> formulation (with "s = colSums(x==0)") would be even better as it allows
> for partial aggregation and hence reduces shuffle.
>
> Regards,
> Matthias
>
> Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
> following script fails with this error:
>
> From: Ethan Xu <et...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/14/2016 01:34 PM
> Subject: parfor fails
> ------------------------------
>
>
>
> Hello,
>
> I have a quick question. The following script fails with this error:
>
> org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
> loop in parallel.
>
> Here is the dml script:
>
> x=read($X);
>
> print("number of rows of x = " + nrow(x));
> print("number of cols of x = " + ncol(x));
>
> parfor(i in 1:ncol(x), check=0){
>   a = x[,i];
>   print("number of 0's in col " + i + " = " + sum(a == 0));
> }
>
> where X is a 35 million by 2396 matrix (coded and dummy coded numerical
> matrix) on HDFS. The script runs fine with regular 'for' loops.
>
> Could someone explain why this script cannot run in parallel? Was it a
> wrong way to code parfor?
>
> Thanks,
>
> Ethan
>
>
>
>

Re: parfor fails

Posted by Matthias Boehm <mb...@us.ibm.com>.
just for completeness, this issue is tracked with
https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
available tomorrow.

Regards,
Matthias



From:	Matthias Boehm/Almaden/IBM@IBMUS
To:	dev@systemml.incubator.apache.org
Cc:	"Ethan Xu" <et...@gmail.com>
Date:	04/14/2016 07:53 PM
Subject:	Re: parfor fails



Hi Ethan,

thanks for catching this issue. The parfor script itself is perfectly fine
but you encountered an interesting runtime bug. Usually, you can find the
actual cause at the bottom of the stacktrace or in previous exceptions. I
was able to reproduce this issue if NO systemml config file is provided
(fails on parsing this non-existing config in the parfor mr job task
setup). So the workaround is to put a SystemML-config.xml into the same
directory. Interestingly, the issue did not show up in our testsuite
because we always specify a default configuration there (which was until
recently mandatory).

As a side note, we strongly recommend parfor over for loops here because it
runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
partitioning. However, for the specific example at hand, a data-parallel
formulation (with "s = colSums(x==0)") would be even better as it allows
for partial aggregation and hence reduces shuffle.

Regards,
Matthias

Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
following script fails with this error:

From: Ethan Xu <et...@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 04/14/2016 01:34 PM
Subject: parfor fails



Hello,

I have a quick question. The following script fails with this error:

org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
loop in parallel.

Here is the dml script:

x=read($X);

print("number of rows of x = " + nrow(x));
print("number of cols of x = " + ncol(x));

parfor(i in 1:ncol(x), check=0){
   a = x[,i];
   print("number of 0's in col " + i + " = " + sum(a == 0));
}

where X is a 35 million by 2396 matrix (coded and dummy coded numerical
matrix) on HDFS. The script runs fine with regular 'for' loops.

Could someone explain why this script cannot run in parallel? Was it a
wrong way to code parfor?

Thanks,

Ethan



Re: parfor fails

Posted by Matthias Boehm <mb...@us.ibm.com>.
Hi Ethan,

thanks for catching this issue. The parfor script itself is perfectly fine
but you encountered an interesting runtime bug. Usually, you can find the
actual cause at the bottom of the stacktrace or in previous exceptions. I
was able to reproduce this issue if NO systemml config file is provided
(fails on parsing this non-existing config in the parfor mr job task
setup). So the workaround is to put a SystemML-config.xml into the same
directory. Interestingly, the issue did not show up in our testsuite
because we always specify a default configuration there (which was until
recently mandatory).

As a side note, we strongly recommend parfor over for loops here because it
runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
partitioning. However, for the specific example at hand, a data-parallel
formulation (with "s = colSums(x==0)") would be even better as it allows
for partial aggregation and hence reduces shuffle.

Regards,
Matthias



From:	Ethan Xu <et...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	04/14/2016 01:34 PM
Subject:	parfor fails



Hello,

I have a quick question. The following script fails with this error:

org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
loop in parallel.

Here is the dml script:

x=read($X);

print("number of rows of x = " + nrow(x));
print("number of cols of x = " + ncol(x));

parfor(i in 1:ncol(x), check=0){
    a = x[,i];
    print("number of 0's in col " + i + " = " + sum(a == 0));
}

where X is a 35 million by 2396 matrix (coded and dummy coded numerical
matrix) on HDFS. The script runs fine with regular 'for' loops.

Could someone explain why this script cannot run in parallel? Was it a
wrong way to code parfor?

Thanks,

Ethan