You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by AlexanderRiggers <al...@gmail.com> on 2014/07/04 00:25:26 UTC

Sample datasets for MLlib and Graphx

Hello!

I want to play around with several different cluster settings and measure
performances for MLlib and GraphX  and was wondering if anybody here could
hit me up with datasets for these applications from 5GB onwards? 

I mostly interested in SVM and Triangle Count, but would be glad for any
help.

Best regards,
Alex



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Sample datasets for MLlib and Graphx

Posted by Nick Pentreath <ni...@gmail.com>.
The Kaggle data is not in libsvm format so you'd have to do some transformation.


The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad prediction data is around 2-3GB compressed I think.




To my knowledge these are the largest binary classification datasets I've come across which are easily publicly available (very happy to be proved wrong about this though :)
—
Sent from Mailbox

On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers
<al...@gmail.com> wrote:

> Nick Pentreath wrote
>> Take a look at Kaggle competition datasets
>> - https://www.kaggle.com/competitions
> I was looking for files in LIBSVM format and never found something on Kaggle
> in bigger size. Most competitions I ve seen need data processing and feature
> generating, but maybe I ve to take a second look.
> Nick Pentreath wrote
>> For graph stuff the SNAP has large network
>> data: https://snap.stanford.edu/data/
> Thanks
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Sample datasets for MLlib and Graphx

Posted by AlexanderRiggers <al...@gmail.com>.
Nick Pentreath wrote
> Take a look at Kaggle competition datasets
> - https://www.kaggle.com/competitions

I was looking for files in LIBSVM format and never found something on Kaggle
in bigger size. Most competitions I ve seen need data processing and feature
generating, but maybe I ve to take a second look.


Nick Pentreath wrote
> For graph stuff the SNAP has large network
> data: https://snap.stanford.edu/data/

Thanks




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Sample datasets for MLlib and Graphx

Posted by Nick Pentreath <ni...@gmail.com>.
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions




For svm there are a couple of ad click prediction datasets of pretty large size.




For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/



—
Sent from Mailbox

On Thu, Jul 3, 2014 at 3:25 PM, AlexanderRiggers
<al...@gmail.com> wrote:

> Hello!
> I want to play around with several different cluster settings and measure
> performances for MLlib and GraphX  and was wondering if anybody here could
> hit me up with datasets for these applications from 5GB onwards? 
> I mostly interested in SVM and Triangle Count, but would be glad for any
> help.
> Best regards,
> Alex
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.