You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bibudh Lahiri <bi...@gmail.com> on 2016/05/05 22:37:06 UTC

How long should logistic regression take on this data?

Hi,
  I am doing the following exercise: I have 100 million labeled records
(total 2.7 GB data) in LibSVM (sparse) format, split across 200 files on
HDFS (each file ~14 MB), so each file has about 500K records. Only 50K of
these 100 million are labeled as "positive", and the rest are all
"negative". I am taking a sample of 50K from the "negative" set, merging it
with the 50K positive, and splitting it into 50% training and 50% test set.
I am training an Elastic Net logistic regression (without regularization)
on the training dataset, testing its performance on the 50K test
datapoints, and then applying the model on the rest of the data (100
million - 100K) to find the class-conditional probabilities of those
examples being positive.

  I have a 2-node cluster, one of them set up as master and both of them
workers, each node having 10 GB executor memory and the driver having 10 GB
memory. My Hadoop cluster is with the same machines as my Spark cluster. My
Spak application is aborting after running for more than 3 hours, and it is
not even reaching the logistic regression part in these 3 hours -  it is
all into the sampling, filtering and merging. Any ballpark about how long
it should take? Are there some known benchmarks for logistic regression?

-- 
Bibudh Lahiri
Senior Data Scientist, Impetus Technolgoies
720 University Avenue, Suite 130
Los Gatos, CA 95129
http://knowthynumbers.blogspot.com/