You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2016/09/21 22:05:20 UTC
[jira] [Comment Edited] (SYSTEMML-946) OOM on spark
dataframe-matrix / csv-matrix conversion
[ https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511329#comment-15511329 ]
Mike Dusenberry edited comment on SYSTEMML-946 at 9/21/16 10:04 PM:
--------------------------------------------------------------------
[~mboehm7] This is using a slightly modified version of the LeNet example I wrote (using the library) as well as some logic to perform a hyperparameter search over it. It has been working on small samples of the overall data (although today I am running into null pointer issues in SYSTEMML-948 with the small data that may be related to the updates for this JIRA).
Here's the code, using DML and the Python MLContext API. Basically, I have a **conversion** script that scales the features and one-hot encodes the labels, for both training and validation splits of the data. Then I have a **train** script that does a hyperparameter search by repeatedly sampling random values for the various hyperparameters and then trains a LeNet-like neural net, saving the accuracy to a file.
I want to be able to let this run for the next week without any errors.
Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1
# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}
Training:
{code}
script = """
source("mnist_lenet.dml") as clf
i = 0
run = TRUE
while(run) {
# Hyperparameters & Settings
lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
batch_size = 50
epochs = 1
iters = ceil(nrow(Y) / batch_size)
# Train
[W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, Win, lr, mu, decay, lambda, batch_size, epochs, iters)
# Eval
probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
[loss, accuracy] = clf::eval(probs, Y)
probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
[loss_val, accuracy_val] = clf::eval(probs_val, Y_val)
# Save hyperparams
str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + lambda
name = "models/"+accuracy_val+","+accuracy+","+i
write(str, name)
i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, Win=256)
ml.execute(script)
{code}
{{mnist_lenet.dml}}: Attached
was (Author: mwdusenb@us.ibm.com):
[~mboehm7] This is using a slightly modified version of the LeNet example I wrote (using the library) as well as some logic to perform a hyperparameter search over it. Here's the code, using DML and the Python MLContext API. Basically, I have a **conversion** script that scales the features and one-hot encodes the labels, for both training and validation splits of the data. Then I have a **train** script that does a hyperparameter search by repeatedly sampling random values for the various hyperparameters and then trains a LeNet-like neural net, saving the accuracy to a file. I want to be able to let this run for the next week without any errors. It has been working on small samples of the overall data (although today I am running into null pointer issues in SYSTEMML-948 with the small data that may be related to the updates for this JIRA).
Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1
# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}
Training:
{code}
script = """
source("mnist_lenet.dml") as clf
i = 0
run = TRUE
while(run) {
# Hyperparameters & Settings
lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
batch_size = 50
epochs = 1
iters = ceil(nrow(Y) / batch_size)
# Train
[W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, Win, lr, mu, decay, lambda, batch_size, epochs, iters)
# Eval
probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
[loss, accuracy] = clf::eval(probs, Y)
probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
[loss_val, accuracy_val] = clf::eval(probs_val, Y_val)
# Save hyperparams
str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + lambda
name = "models/"+accuracy_val+","+accuracy+","+i
write(str, name)
i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, Win=256)
ml.execute(script)
{code}
{{mnist_lenet.dml}}: Attached
> OOM on spark dataframe-matrix / csv-matrix conversion
> -----------------------------------------------------
>
> Key: SYSTEMML-946
> URL: https://issues.apache.org/jira/browse/SYSTEMML-946
> Project: SystemML
> Issue Type: Bug
> Components: Runtime
> Reporter: Matthias Boehm
> Attachments: mnist_lenet.dml
>
>
> The decision on dense/sparse block allocation in our dataframeToBinaryBlock and csvToBinaryBlock data converters is purely based on the sparsity. This works very well for the common case of tall & skinny matrices. However, for scenarios with dense data but huge number of columns a single partition will rarely have 1000 rows to fill an entire row of blocks. This leads to unnecessary allocation and dense-sparse conversion as well as potential out-of-memory errors because the temporary memory requirement can be up to 1000x larger than the input partition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)