You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Igor Wiese <ig...@gmail.com> on 2015/12/10 01:04:05 UTC

Feedback of my Phd work in Derby

Hi, Derby Community.

My name is Igor Wiese, phd Student from Brazil. In my research I am
investigating two important questions: What makes two files change
together? Can we predict when they are going to co-change again?

I've tried to investigate this question on the Derby project. I've
collected data from issue reports, discussions and commits and using some
machine learning techniques to build a prediction model.

I collected a total of 5266 commits in which a pair of files changed
together and could correctly predict 86% commits. These were the most
useful information for predicting co-changes of files:

- number of lines of code added,

- number of lines of code removed,

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues, and

- median value of closeness, a social network measure obtained from issue
comments.

To illustrate, consider the following example from our analysis. For
release 10.10, the files "sql/catalog/DataDictionaryImpl.java" and
"impl/storeless/EmptyDictionary.java" changed together in 7 commits. In
another 4 commits, only the first file changed, but not the second.
Collecting contextual information for each commit made to first file in the
previous release, we were able to predict all 7 commits in which both files
changed together in release 10.10, and we only issued 2 wrong predictions.
For this pair of files, the most important contextual information was the
number of lines of code added, removed and modified in each commit, and a
social network measure (constraint) obtained from issue comments.

- Do these results surprise you? Can you think in any explanation for the
results?

- Do you think that our rate of prediction is good enough to be used for
building tool support for the software community?

- Do you have any suggestion on what can be done to improve the change
recommendation?

You can visit our webpage to inspect the results in details:
http://flosscoach.com/index.php/17-cochanges/69-derby

All the best,
Igor Wiese
Phd Candidate

Re: Feedback of my Phd work in Derby

Posted by Rick Hillegas <ri...@gmail.com>.
Hi Igor,

One comment inline...

On 12/9/15 4:04 PM, Igor Wiese wrote:
>
> Hi, Derby Community.
>
>
> My name is Igor Wiese, phd Student from Brazil. In my research I am 
> investigating two important questions: What makes two files change 
> together? Can we predict when they are going to co-change again?
>
>
> I've tried to investigate this question on the Derby project. I've 
> collected data from issue reports, discussions and commits and using 
> some machine learning techniques to build a prediction model.
>
>
> I collected a total of 5266 commits in which a pair of files changed 
> together and could correctly predict 86% commits. These were the most 
> useful information for predicting co-changes of files:
>
> - number of lines of code added,
>
> - number of lines of code removed,
>
> - sum of number of lines of code added, modified and removed,
>
> - number of words used to describe and discuss the issues, and
>
> - median value of closeness, a social network measure obtained from 
> issue comments.
>
>
> To illustrate, consider the following example from our analysis. For 
> release 10.10, the files "sql/catalog/DataDictionaryImpl.java" and 
> "impl/storeless/EmptyDictionary.java" changed together in 7 commits. 
> In another 4 commits, only the first file changed, but not the second. 
> Collecting contextual information for each commit made to first file 
> in the previous release, we were able to predict all 7 commits in 
> which both files changed together in release 10.10, and we only issued 
> 2 wrong predictions. For this pair of files, the most important 
> contextual information was the number of lines of code added, removed 
> and modified in each commit, and a social network measure (constraint) 
> obtained from issue comments.
>
>
> - Do these results surprise you? Can you think in any explanation for 
> the results?
>
These results do not surprise me. That is because DataDictionaryImpl and 
EmptyDictionary are both implementions of the DataDictionary interface. 
This is what happens during development:

1) Someone wants to add a language feature which requires new metadata 
capabilities.

2) The new capabilities are added to the real catalog implementation, 
which is DataDictionaryImpl.

3) In order to use the new capabilities, they must be exposed to other 
Derby components by having corresponding methods added to the 
DataDictionary interface.

4) That, in turn, forces the developer to add a vacuous stub method to 
EmptyDictionary.

I don't know if anyone uses the EmptyDictionary. At this point, it may 
be nothing more than a tax which has to be paid every time someone 
touches the data dictionary. EmptyDictionary is part of the storeless 
implementation of Derby which was apparently introduced in order to let 
people use the Derby parser to validate SQL syntax without actually 
running queries. That, at least, is the motivation described by 
http://mail-archives.apache.org/mod_mbox/db-derby-user/200612.mbox/%3C45704D0E.9030102@apache.org%3E 
and https://issues.apache.org/jira/browse/DERBY-2164. There are other 
solutions to that problem which have received more uptake in the 
community. See, for instance, 
https://issues.apache.org/jira/browse/DERBY-3946

Hope this explanation is useful,
-Rick
>
> - Do you think that our rate of prediction is good enough to be used 
> for building tool support for the software community?
>
> - Do you have any suggestion on what can be done to improve the change 
> recommendation?
>
>
> You can visit our webpage to inspect the results in details: 
> http://flosscoach.com/index.php/17-cochanges/69-derby
>
>
> All the best,
> Igor Wiese
>
> Phd Candidate