You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by Shafaq Siddiqi <sh...@tugraz.at.INVALID> on 2021/03/24 14:04:55 UTC

Refactoring datasets in SystemDS

Hi,

Some of the test suites in SystemDS use external data files that are 
stored in the test package along with the test files. I have observed 
that there are some test files that use the dataset residing in another 
test package such as the Iris dataset is being used by gmm and 
gmmPredict tests and it is stored inside the transform test package. The 
same is the case for the Salary dataset that is used by different test 
files.

In my opinion, it would be effective if we store all datasets inside the 
resource folder so that the existing datasets are available up-front and 
could be reused instead of introducing a new dataset every now and then 
and it will also simplify the referencing of the datasets across test 
suites.


--br,
Shafaq Siddiqi

Re: Refactoring datasets in SystemDS

Posted by Mark Dokter <md...@know-center.at>.


On 24.03.2021 17:15, Matthias Boehm wrote:
> thanks for bringing this up - sounds good to me as well.
> 

+1
I also think it's a good suggestion.

However, I see another cause of repository bloat here. Like with the
binaries this could be separated out as not everybody who wants to check
out the source necessarily needs all that. *If* Maven can be configured
to download the needed files upon first mvn package from a third party
source (which we control ofc), that'd be great and we can modularize a
bit better.

> Regards,
> Matthias
> 

Regards, Mark


> On 3/24/2021 3:21 PM, arnab phani wrote:
>> I agree.
>> That way, we don't need to look through the folders for datasets while
>> writing a new test.
>> In addition to that, is it possible to write the test functions in a way
>> that the test will automatically apply to all the datasets in
>> /resource? If
>> so, then it will be much easier to test with a new dataset --- we will
>> just
>> need to add it in the designated folder.
>>
>> Regards,
>> Arnab..
>>
>> On Wed, Mar 24, 2021 at 3:05 PM Shafaq Siddiqi
>> <sh...@tugraz.at.invalid> wrote:
>>
>>> Hi,
>>>
>>> Some of the test suites in SystemDS use external data files that are
>>> stored in the test package along with the test files. I have observed
>>> that there are some test files that use the dataset residing in another
>>> test package such as the Iris dataset is being used by gmm and
>>> gmmPredict tests and it is stored inside the transform test package. The
>>> same is the case for the Salary dataset that is used by different test
>>> files.
>>>
>>> In my opinion, it would be effective if we store all datasets inside the
>>> resource folder so that the existing datasets are available up-front and
>>> could be reused instead of introducing a new dataset every now and then
>>> and it will also simplify the referencing of the datasets across test
>>> suites.
>>>
>>>
>>> --br,
>>> Shafaq Siddiqi
>>>
>>>
>>

Re: Refactoring datasets in SystemDS

Posted by Matthias Boehm <mb...@gmail.com>.

thanks for bringing this up - sounds good to me as well.

Regards,
Matthias

On 3/24/2021 3:21 PM, arnab phani wrote:
> I agree.
> That way, we don't need to look through the folders for datasets while
> writing a new test.
> In addition to that, is it possible to write the test functions in a way
> that the test will automatically apply to all the datasets in /resource? If
> so, then it will be much easier to test with a new dataset --- we will just
> need to add it in the designated folder.
> 
> Regards,
> Arnab..
> 
> On Wed, Mar 24, 2021 at 3:05 PM Shafaq Siddiqi
> <sh...@tugraz.at.invalid> wrote:
> 
>> Hi,
>>
>> Some of the test suites in SystemDS use external data files that are
>> stored in the test package along with the test files. I have observed
>> that there are some test files that use the dataset residing in another
>> test package such as the Iris dataset is being used by gmm and
>> gmmPredict tests and it is stored inside the transform test package. The
>> same is the case for the Salary dataset that is used by different test
>> files.
>>
>> In my opinion, it would be effective if we store all datasets inside the
>> resource folder so that the existing datasets are available up-front and
>> could be reused instead of introducing a new dataset every now and then
>> and it will also simplify the referencing of the datasets across test
>> suites.
>>
>>
>> --br,
>> Shafaq Siddiqi
>>
>>
>

Re: Refactoring datasets in SystemDS

Posted by arnab phani <ph...@gmail.com>.

I agree.
That way, we don't need to look through the folders for datasets while
writing a new test.
In addition to that, is it possible to write the test functions in a way
that the test will automatically apply to all the datasets in /resource? If
so, then it will be much easier to test with a new dataset --- we will just
need to add it in the designated folder.

Regards,
Arnab..

On Wed, Mar 24, 2021 at 3:05 PM Shafaq Siddiqi
<sh...@tugraz.at.invalid> wrote:

> Hi,
>
> Some of the test suites in SystemDS use external data files that are
> stored in the test package along with the test files. I have observed
> that there are some test files that use the dataset residing in another
> test package such as the Iris dataset is being used by gmm and
> gmmPredict tests and it is stored inside the transform test package. The
> same is the case for the Salary dataset that is used by different test
> files.
>
> In my opinion, it would be effective if we store all datasets inside the
> resource folder so that the existing datasets are available up-front and
> could be reused instead of introducing a new dataset every now and then
> and it will also simplify the referencing of the datasets across test
> suites.
>
>
> --br,
> Shafaq Siddiqi
>
>