You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2009/10/08 14:09:47 UTC

Example Datasets

We need a central place for all sample datasets used for examples and unit
tests? I am against putting it in the repo
Any suggestions?

Robin

Re: Example Datasets

Posted by Robin Anil <ro...@gmail.com>.
Take a look at this repo http://fimi.cs.helsinki.fi/data/
I am specifically talking about the retail and accidents dataset. A modified
version of them(comma separated)  is being used by me for FPGrowth testing.
Webdocs dataset looks good enough for being used for parallel fpgrowth
testing.

Question is shall i use the url to fetch them , then convert to the required
format.
Or keep the converted format in a repo like in
people.apache.org/~robinanil/datasets/ or something dedicated for mahout.



On Thu, Oct 8, 2009 at 5:39 PM, Robin Anil <ro...@gmail.com> wrote:

> We need a central place for all sample datasets used for examples and unit
> tests? I am against putting it in the repo
> Any suggestions?
>
> Robin
>

Re: Example Datasets

Posted by Ted Dunning <te...@gmail.com>.
For redistributable data, we should definitely lock down a version in our
distribution or an associated one.  This is true if only to make sure that
we don't get surprised by somebody rearranging their web site.

For non-redistributable but available data, I think having a download
procedure that sucks the data down from a URL is fine.  There is probably
even a maven life-cycle that is appropriate (test-process-resources or some
such)

On Thu, Oct 8, 2009 at 5:32 AM, Sean Owen <sr...@gmail.com> wrote:

> Several data sets I use have distribution clauses that forbid or
> complicate redistribution, so not sure I can do that. Of course we
> should check that on any other data set.
>
> On Thu, Oct 8, 2009 at 1:09 PM, Robin Anil <ro...@gmail.com> wrote:
> > We need a central place for all sample datasets used for examples and
> unit
> > tests? I am against putting it in the repo
> > Any suggestions?
> >
> > Robin
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Example Datasets

Posted by Sean Owen <sr...@gmail.com>.
Several data sets I use have distribution clauses that forbid or
complicate redistribution, so not sure I can do that. Of course we
should check that on any other data set.

On Thu, Oct 8, 2009 at 1:09 PM, Robin Anil <ro...@gmail.com> wrote:
> We need a central place for all sample datasets used for examples and unit
> tests? I am against putting it in the repo
> Any suggestions?
>
> Robin
>

Re: Example Datasets

Posted by Isabel Drost <is...@apache.org>.
On Thu, 8 Oct 2009 17:39:47 +0530
Robin Anil <ro...@gmail.com> wrote:

> We need a central place for all sample datasets used for examples and
> unit tests? I am against putting it in the repo
> Any suggestions?

The data in question is the following:

http://fimi.cs.helsinki.fi/data/ (retail, accidents, webdocs). Retail
and accidents do not clearly include information on redistribution. I
cannot find any license note on webdocs.

I would suggest writing to the authors and requesting permission to
redistribute as part of Mahout to be on the safe side.

Isabel