You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2009/10/08 14:09:47 UTC
Example Datasets
We need a central place for all sample datasets used for examples and unit
tests? I am against putting it in the repo
Any suggestions?
Robin
Re: Example Datasets
Posted by Robin Anil <ro...@gmail.com>.
Take a look at this repo http://fimi.cs.helsinki.fi/data/
I am specifically talking about the retail and accidents dataset. A modified
version of them(comma separated) is being used by me for FPGrowth testing.
Webdocs dataset looks good enough for being used for parallel fpgrowth
testing.
Question is shall i use the url to fetch them , then convert to the required
format.
Or keep the converted format in a repo like in
people.apache.org/~robinanil/datasets/ or something dedicated for mahout.
On Thu, Oct 8, 2009 at 5:39 PM, Robin Anil <ro...@gmail.com> wrote:
> We need a central place for all sample datasets used for examples and unit
> tests? I am against putting it in the repo
> Any suggestions?
>
> Robin
>
Re: Example Datasets
Posted by Ted Dunning <te...@gmail.com>.
For redistributable data, we should definitely lock down a version in our
distribution or an associated one. This is true if only to make sure that
we don't get surprised by somebody rearranging their web site.
For non-redistributable but available data, I think having a download
procedure that sucks the data down from a URL is fine. There is probably
even a maven life-cycle that is appropriate (test-process-resources or some
such)
On Thu, Oct 8, 2009 at 5:32 AM, Sean Owen <sr...@gmail.com> wrote:
> Several data sets I use have distribution clauses that forbid or
> complicate redistribution, so not sure I can do that. Of course we
> should check that on any other data set.
>
> On Thu, Oct 8, 2009 at 1:09 PM, Robin Anil <ro...@gmail.com> wrote:
> > We need a central place for all sample datasets used for examples and
> unit
> > tests? I am against putting it in the repo
> > Any suggestions?
> >
> > Robin
> >
>
--
Ted Dunning, CTO
DeepDyve
Re: Example Datasets
Posted by Sean Owen <sr...@gmail.com>.
Several data sets I use have distribution clauses that forbid or
complicate redistribution, so not sure I can do that. Of course we
should check that on any other data set.
On Thu, Oct 8, 2009 at 1:09 PM, Robin Anil <ro...@gmail.com> wrote:
> We need a central place for all sample datasets used for examples and unit
> tests? I am against putting it in the repo
> Any suggestions?
>
> Robin
>
Re: Example Datasets
Posted by Isabel Drost <is...@apache.org>.
On Thu, 8 Oct 2009 17:39:47 +0530
Robin Anil <ro...@gmail.com> wrote:
> We need a central place for all sample datasets used for examples and
> unit tests? I am against putting it in the repo
> Any suggestions?
The data in question is the following:
http://fimi.cs.helsinki.fi/data/ (retail, accidents, webdocs). Retail
and accidents do not clearly include information on redistribution. I
cannot find any license note on webdocs.
I would suggest writing to the authors and requesting permission to
redistribute as part of Mahout to be on the safe side.
Isabel