You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2016/09/21 22:05:20 UTC

[jira] [Comment Edited] (SYSTEMML-946) OOM on spark dataframe-matrix / csv-matrix conversion

    [ https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511329#comment-15511329 ] 

Mike Dusenberry edited comment on SYSTEMML-946 at 9/21/16 10:04 PM:
--------------------------------------------------------------------

[~mboehm7]  This is using a slightly modified version of the LeNet example I wrote (using the library) as well as some logic to perform a hyperparameter search over it.  It has been working on small samples of the overall data (although today I am running into null pointer issues in SYSTEMML-948 with the small data that may be related to the updates for this JIRA).

Here's the code, using DML and the Python MLContext API.  Basically, I have a **conversion** script that scales the features and one-hot encodes the labels, for both training and validation splits of the data.  Then I have a **train** script that does a hyperparameter search by repeatedly sampling random values for the various hyperparameters and then trains a LeNet-like neural net, saving the accuracy to a file.

I want to be able to let this run for the next week without any errors. 

Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1

# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}

Training:
{code}
script = """
source("mnist_lenet.dml") as clf

i = 0
run = TRUE
while(run) {
  # Hyperparameters & Settings
  lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
  decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
  lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  batch_size = 50
  epochs = 1
  iters = ceil(nrow(Y) / batch_size)

  # Train
  [W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, Win, lr, mu, decay, lambda, batch_size, epochs, iters)

  # Eval
  probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss, accuracy] = clf::eval(probs, Y)
  probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss_val, accuracy_val] = clf::eval(probs_val, Y_val)

  # Save hyperparams
  str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + lambda
  name = "models/"+accuracy_val+","+accuracy+","+i
  write(str, name)
  i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, Win=256)
ml.execute(script)
{code}

{{mnist_lenet.dml}}: Attached


was (Author: mwdusenb@us.ibm.com):
[~mboehm7]  This is using a slightly modified version of the LeNet example I wrote (using the library) as well as some logic to perform a hyperparameter search over it.  Here's the code, using DML and the Python MLContext API.  Basically, I have a **conversion** script that scales the features and one-hot encodes the labels, for both training and validation splits of the data.  Then I have a **train** script that does a hyperparameter search by repeatedly sampling random values for the various hyperparameters and then trains a LeNet-like neural net, saving the accuracy to a file.  I want to be able to let this run for the next week without any errors.  It has been working on small samples of the overall data (although today I am running into null pointer issues in SYSTEMML-948 with the small data that may be related to the updates for this JIRA). 

Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1

# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}

Training:
{code}
script = """
source("mnist_lenet.dml") as clf

i = 0
run = TRUE
while(run) {
  # Hyperparameters & Settings
  lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
  decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
  lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  batch_size = 50
  epochs = 1
  iters = ceil(nrow(Y) / batch_size)

  # Train
  [W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, Win, lr, mu, decay, lambda, batch_size, epochs, iters)

  # Eval
  probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss, accuracy] = clf::eval(probs, Y)
  probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss_val, accuracy_val] = clf::eval(probs_val, Y_val)

  # Save hyperparams
  str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + lambda
  name = "models/"+accuracy_val+","+accuracy+","+i
  write(str, name)
  i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, Win=256)
ml.execute(script)
{code}

{{mnist_lenet.dml}}: Attached

> OOM on spark dataframe-matrix / csv-matrix conversion
> -----------------------------------------------------
>
>                 Key: SYSTEMML-946
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-946
>             Project: SystemML
>          Issue Type: Bug
>          Components: Runtime
>            Reporter: Matthias Boehm
>         Attachments: mnist_lenet.dml
>
>
> The decision on dense/sparse block allocation in our dataframeToBinaryBlock and csvToBinaryBlock data converters is purely based on the sparsity. This works very well for the common case of tall & skinny matrices. However, for scenarios with dense data but huge number of columns a single partition will rarely have 1000 rows to fill an entire row of blocks. This leads to unnecessary allocation and dense-sparse conversion as well as potential out-of-memory errors because the temporary memory requirement can be up to 1000x larger than the input partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)