You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Igor Wiese <ig...@gmail.com> on 2015/12/14 23:59:33 UTC

Information from Derby Community

Hi, Derby Community.

My name is Igor Wiese, phd Student from Brazil. I sent an email a week
ago about my research. We received some visit to inspect the results
but any feedback was provided.

I am investigating two important questions: What makes two files
change together? Can we predict when they are going to co-change
again?


I've tried to investigate this question on the Derby project. I've
collected data from issue reports, discussions and commits and using
some machine learning techniques to build a prediction model.


I collected a total of 5266 commits in which a pair of files changed
together and could correctly predict 86% commits. These were the most
useful information for predicting co-changes of files:

- number of lines of code added,

- number of lines of code removed,

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues, and

- median value of closeness, a social network measure obtained from
issue comments.


To illustrate, consider the following example from our analysis. For
release 10.10, the files "sql/catalog/DataDictionaryImpl.java" and
"impl/storeless/EmptyDictionary.java" changed together in 7 commits.
In another 4 commits, only the first file changed, but not the second.
Collecting contextual information for each commit made to first file
in the previous release, we were able to predict all 7 commits in
which both files changed together in release 10.10, and we only issued
2 wrong predictions. For this pair of files, the most important
contextual information was the number of lines of code added, removed
and modified in each commit, and a social network measure (constraint)
obtained from issue comments.


- Do these results surprise you? Can you think in any explanation for
the results?

- Do you think that our rate of prediction is good enough to be used
for building tool support for the software community?

- Do you have any suggestion on what can be done to improve the change
recommendation?


You can visit our webpage to inspect the results in details:
http://flosscoach.com/index.php/17-cochanges/69-derby


All the best,
Igor Wiese

Phd Candidate


-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Information from Derby Community

Posted by Igor Wiese <ig...@gmail.com>.
That is the main aim, as you told our focus is "...wherever you have
undeclared dependencies among files"

Thanks for the feedback Rick!
You help me a lot!

All the best,
Igor Wiese



2015-12-15 13:18 GMT-02:00 Rick Hillegas <ri...@gmail.com>:
> This tool could be useful in tracking down methods whose switch statements
> need to be updated when, say, you add a new enum value. In general, this
> tool could be useful wherever you have undeclared dependencies among files
> and components, which the compiler can't track.
>
> Thanks,
> -Rick
>
>
> On 12/15/15 3:51 AM, Igor Wiese wrote:
>>
>> That is the idea Bryan.
>>
>> Let's suppose that you are reviewing a certain issue, or started to
>> work in a issue and changed any file. Our approach would recommend
>> other files prone to change together in this task because the files
>> changed in past issues with the same "context" (context here means
>> issues with same reporter, committer, similar number of lines of code
>> added, removed, modified, size of discussion, etc)
>>
>> Now, the idea is provide a webservice "as a oracle" that developers
>> from apache could visit and obtain this information, but we are
>> thinking in the best way to concept the tool and implement it.
>>
>> Many thanks for your comment :)
>>
>>
>> 2015-12-15 2:05 GMT-02:00 Bryan Pendleton<bp...@gmail.com>:
>>>>
>>>> As a developer the normal "way" to find files to change together to
>>>> complete an issue is based on our own experience, debugging or through
>>>> the documentation, right?
>>>
>>>
>>> Yes, I agree that is the normal way.
>>>
>>> Also through code review, running tests, and messages from the compiler.
>>>
>>> Is your idea that, given a database of change history as you have
>>> described it, some tool would be able to notice when the developer
>>> makes a certain type of change, and then suggest other related
>>> changes that are typically made at the same time?
>>>
>>> I think that's a pretty interesting idea.
>>>
>>> bryan
>>>
>>
>>
>



-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Information from Derby Community

Posted by Rick Hillegas <ri...@gmail.com>.
This tool could be useful in tracking down methods whose switch 
statements need to be updated when, say, you add a new enum value. In 
general, this tool could be useful wherever you have undeclared 
dependencies among files and components, which the compiler can't track.

Thanks,
-Rick

On 12/15/15 3:51 AM, Igor Wiese wrote:
> That is the idea Bryan.
>
> Let's suppose that you are reviewing a certain issue, or started to
> work in a issue and changed any file. Our approach would recommend
> other files prone to change together in this task because the files
> changed in past issues with the same "context" (context here means
> issues with same reporter, committer, similar number of lines of code
> added, removed, modified, size of discussion, etc)
>
> Now, the idea is provide a webservice "as a oracle" that developers
> from apache could visit and obtain this information, but we are
> thinking in the best way to concept the tool and implement it.
>
> Many thanks for your comment :)
>
>
> 2015-12-15 2:05 GMT-02:00 Bryan Pendleton<bp...@gmail.com>:
>>> As a developer the normal "way" to find files to change together to
>>> complete an issue is based on our own experience, debugging or through
>>> the documentation, right?
>>
>> Yes, I agree that is the normal way.
>>
>> Also through code review, running tests, and messages from the compiler.
>>
>> Is your idea that, given a database of change history as you have
>> described it, some tool would be able to notice when the developer
>> makes a certain type of change, and then suggest other related
>> changes that are typically made at the same time?
>>
>> I think that's a pretty interesting idea.
>>
>> bryan
>>
>
>


Re: Information from Derby Community

Posted by Igor Wiese <ig...@gmail.com>.
That is the idea Bryan.

Let's suppose that you are reviewing a certain issue, or started to
work in a issue and changed any file. Our approach would recommend
other files prone to change together in this task because the files
changed in past issues with the same "context" (context here means
issues with same reporter, committer, similar number of lines of code
added, removed, modified, size of discussion, etc)

Now, the idea is provide a webservice "as a oracle" that developers
from apache could visit and obtain this information, but we are
thinking in the best way to concept the tool and implement it.

Many thanks for your comment :)


2015-12-15 2:05 GMT-02:00 Bryan Pendleton <bp...@gmail.com>:
>> As a developer the normal "way" to find files to change together to
>> complete an issue is based on our own experience, debugging or through
>> the documentation, right?
>
>
> Yes, I agree that is the normal way.
>
> Also through code review, running tests, and messages from the compiler.
>
> Is your idea that, given a database of change history as you have
> described it, some tool would be able to notice when the developer
> makes a certain type of change, and then suggest other related
> changes that are typically made at the same time?
>
> I think that's a pretty interesting idea.
>
> bryan
>



-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Information from Derby Community

Posted by Bryan Pendleton <bp...@gmail.com>.
> As a developer the normal "way" to find files to change together to
> complete an issue is based on our own experience, debugging or through
> the documentation, right?

Yes, I agree that is the normal way.

Also through code review, running tests, and messages from the compiler.

Is your idea that, given a database of change history as you have
described it, some tool would be able to notice when the developer
makes a certain type of change, and then suggest other related
changes that are typically made at the same time?

I think that's a pretty interesting idea.

bryan


Re: Information from Derby Community

Posted by Igor Wiese <ig...@gmail.com>.
Hi Rick. Now I received :-)

For sure, your explanation was very useful :-).

But, in a general case?
(http://flosscoach.com/index.php/17-cochanges/69-derby) Let's consider
that any both files changed some times together, and in some other
issues they didn't change together. We can predict 86% (in derby
project) of this changes using metrics from issues, lines of codes and
developers communication.

As a developer the normal "way" to find files to change together to
complete an issue is based on our own experience, debugging or through
the documentation, right?

What do you think? Considering these aspects, are you suprised?

Do you think that our approach could be useful for developers
(newcomers, committers, code reviewers, testers)

All the best,
Many thanks!
Igor Wiese

2015-12-14 23:21 GMT-02:00 Rick Hillegas <ri...@gmail.com>:
> Hi Igor,
>
> I sent the following response to your first request for feedback. I don't
> know why you didn't receive my response. Here it is again...
>
> ---------
>
> These results do not surprise me. That is because DataDictionaryImpl and
> EmptyDictionary are both implementions of the DataDictionary interface. This
> is what happens during development:
>
> 1) Someone wants to add a language feature which requires new metadata
> capabilities.
>
> 2) The new capabilities are added to the real catalog implementation, which
> is DataDictionaryImpl.
>
> 3) In order to use the new capabilities, they must be exposed to other Derby
> components by having corresponding methods added to the DataDictionary
> interface.
>
> 4) That, in turn, forces the developer to add a vacuous stub method to
> EmptyDictionary.
>
> I don't know if anyone uses the EmptyDictionary. At this point, it may be
> nothing more than a tax which has to be paid every time someone touches the
> data dictionary. EmptyDictionary is part of the storeless implementation of
> Derby which was apparently introduced in order to let people use the Derby
> parser to validate SQL syntax without actually running queries. That, at
> least, is the motivation described by
> http://mail-archives.apache.org/mod_mbox/db-derby-user/200612.mbox/%3C45704D0E.9030102@apache.org%3E
> and https://issues.apache.org/jira/browse/DERBY-2164. There are other
> solutions to that problem which have received more uptake in the community.
> See, for instance, https://issues.apache.org/jira/browse/DERBY-3946
>
> Hope this explanation is useful,
> -Rick
>
>
> On 12/14/15 2:59 PM, Igor Wiese wrote:
>>
>> Hi, Derby Community.
>>
>> My name is Igor Wiese, phd Student from Brazil. I sent an email a week
>> ago about my research. We received some visit to inspect the results
>> but any feedback was provided.
>>
>> I am investigating two important questions: What makes two files
>> change together? Can we predict when they are going to co-change
>> again?
>>
>>
>> I've tried to investigate this question on the Derby project. I've
>> collected data from issue reports, discussions and commits and using
>> some machine learning techniques to build a prediction model.
>>
>>
>> I collected a total of 5266 commits in which a pair of files changed
>> together and could correctly predict 86% commits. These were the most
>> useful information for predicting co-changes of files:
>>
>> - number of lines of code added,
>>
>> - number of lines of code removed,
>>
>> - sum of number of lines of code added, modified and removed,
>>
>> - number of words used to describe and discuss the issues, and
>>
>> - median value of closeness, a social network measure obtained from
>> issue comments.
>>
>>
>> To illustrate, consider the following example from our analysis. For
>> release 10.10, the files "sql/catalog/DataDictionaryImpl.java" and
>> "impl/storeless/EmptyDictionary.java" changed together in 7 commits.
>> In another 4 commits, only the first file changed, but not the second.
>> Collecting contextual information for each commit made to first file
>> in the previous release, we were able to predict all 7 commits in
>> which both files changed together in release 10.10, and we only issued
>> 2 wrong predictions. For this pair of files, the most important
>> contextual information was the number of lines of code added, removed
>> and modified in each commit, and a social network measure (constraint)
>> obtained from issue comments.
>>
>>
>> - Do these results surprise you? Can you think in any explanation for
>> the results?
>>
>> - Do you think that our rate of prediction is good enough to be used
>> for building tool support for the software community?
>>
>> - Do you have any suggestion on what can be done to improve the change
>> recommendation?
>>
>>
>> You can visit our webpage to inspect the results in details:
>> http://flosscoach.com/index.php/17-cochanges/69-derby
>>
>>
>> All the best,
>> Igor Wiese
>>
>> Phd Candidate
>>
>>
>



-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Information from Derby Community

Posted by Rick Hillegas <ri...@gmail.com>.
Hi Igor,

I sent the following response to your first request for feedback. I 
don't know why you didn't receive my response. Here it is again...

---------

These results do not surprise me. That is because DataDictionaryImpl and 
EmptyDictionary are both implementions of the DataDictionary interface. 
This is what happens during development:

1) Someone wants to add a language feature which requires new metadata 
capabilities.

2) The new capabilities are added to the real catalog implementation, 
which is DataDictionaryImpl.

3) In order to use the new capabilities, they must be exposed to other 
Derby components by having corresponding methods added to the 
DataDictionary interface.

4) That, in turn, forces the developer to add a vacuous stub method to 
EmptyDictionary.

I don't know if anyone uses the EmptyDictionary. At this point, it may 
be nothing more than a tax which has to be paid every time someone 
touches the data dictionary. EmptyDictionary is part of the storeless 
implementation of Derby which was apparently introduced in order to let 
people use the Derby parser to validate SQL syntax without actually 
running queries. That, at least, is the motivation described by 
http://mail-archives.apache.org/mod_mbox/db-derby-user/200612.mbox/%3C45704D0E.9030102@apache.org%3E 
and https://issues.apache.org/jira/browse/DERBY-2164. There are other 
solutions to that problem which have received more uptake in the 
community. See, for instance, 
https://issues.apache.org/jira/browse/DERBY-3946

Hope this explanation is useful,
-Rick

On 12/14/15 2:59 PM, Igor Wiese wrote:
> Hi, Derby Community.
>
> My name is Igor Wiese, phd Student from Brazil. I sent an email a week
> ago about my research. We received some visit to inspect the results
> but any feedback was provided.
>
> I am investigating two important questions: What makes two files
> change together? Can we predict when they are going to co-change
> again?
>
>
> I've tried to investigate this question on the Derby project. I've
> collected data from issue reports, discussions and commits and using
> some machine learning techniques to build a prediction model.
>
>
> I collected a total of 5266 commits in which a pair of files changed
> together and could correctly predict 86% commits. These were the most
> useful information for predicting co-changes of files:
>
> - number of lines of code added,
>
> - number of lines of code removed,
>
> - sum of number of lines of code added, modified and removed,
>
> - number of words used to describe and discuss the issues, and
>
> - median value of closeness, a social network measure obtained from
> issue comments.
>
>
> To illustrate, consider the following example from our analysis. For
> release 10.10, the files "sql/catalog/DataDictionaryImpl.java" and
> "impl/storeless/EmptyDictionary.java" changed together in 7 commits.
> In another 4 commits, only the first file changed, but not the second.
> Collecting contextual information for each commit made to first file
> in the previous release, we were able to predict all 7 commits in
> which both files changed together in release 10.10, and we only issued
> 2 wrong predictions. For this pair of files, the most important
> contextual information was the number of lines of code added, removed
> and modified in each commit, and a social network measure (constraint)
> obtained from issue comments.
>
>
> - Do these results surprise you? Can you think in any explanation for
> the results?
>
> - Do you think that our rate of prediction is good enough to be used
> for building tool support for the software community?
>
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
>
>
> You can visit our webpage to inspect the results in details:
> http://flosscoach.com/index.php/17-cochanges/69-derby
>
>
> All the best,
> Igor Wiese
>
> Phd Candidate
>
>