You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chih-Jen Lin <cj...@csie.ntu.edu.tw> on 2014/10/24 04:33:50 UTC

large benchmark sets for MLlib experiments

Hi MLlib users,

In August when I gave a talk at Databricks, Xiangrui mentioned the
need of large public data for the development of MLlib.
At this moment many use problems in libsvm data sets for experiments.
The file size of larger ones (e.g., kddb) is about 20-30G.

To fullfill the need, we have provided a much larger data set
"splice_site" at libsvm data sets 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). 
The training file is around 600G, while the test file is 300G.

The full set of the same problem (3TB) was used in 
"A Reliable Effective Terascale Linear Learning System"
by Agarwal et al. The original set is from 
"COFFIN : A Computational Framework for Linear SVMs" by
Sonnenburg and Franc.

Please note that this problem is highly unbalanced, so
accuracy is NOT a suitable criterion. You may use
auPRC (area under precision-recall curve).

We thank the data providers, Olivier Chapelle for providing a script
for data generation, and my students for their help.

We will keep adding more large sets in the future.

Enjoy,
Chih-Jen

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org