You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Wei Zhang <we...@us.ibm.com> on 2014/08/07 16:07:06 UTC

where to find representative workload to benchmark mahout


Hello,

I am interested in benchmarking Mahout on different hardware/software
platforms, and I am looking for (real/synthetic) dataset (ideally between
tens of GBs to couple of TBs).

I am particularly interested in the K-means, (naive) Bayesian Network and
Collaborative Filtering (ALS-WR) implementation.

I found some potentially interesting (synthetic/real) benchmarks, but since
I have never really tried any of those. I would like to hear if there is
any recommendation of which one is better(in terms of easiness and
validity) to use or is there any
other alternative ?

(1) BigDataBench from ICT, Chinese Academy of Sciences
http://prof.ict.ac.cn/BigDataBench/
It has all the benchmarks for the 3 applications that I am interested in.
(real / synthetic)
(2) HiBench from Intel
https://github.com/intel-hadoop/HiBench/wiki
It has data for K-means (synthetic)
(3)  SNAP from stanford
http://snap.stanford.edu/
It has data for collaborative filtering (real)

Thank you very much!

Wei