You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Igor Wiese <ig...@gmail.com> on 2015/12/10 00:31:26 UTC

Feedback of my Phd work in Cloudstack Project

Hi, Cloudstack Community.

My name is Igor Wiese, phd Student from Brazil. In my research, I am
investigating two important questions: What makes two files change
together? Can we predict when they are going to co-change again?

I've tried to investigate this question on the Cloudstack project. I've
collected data from issue reports, discussions and commits and using some
machine learning techniques to build a prediction model.

I collected a total of 141 commits in which a pair of files changed
together and could correctly predict 60% commits. These were the most
useful information for predicting co-changes of files:

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues,

- number of comments in each issue,

- median value of closeness, a social network measure obtained from issue
comments, and

- median value of constraint, a social network measure obtained from issue
comments.

To illustrate, consider the following example from our analysis. For
release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
"cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits. In
another 2 commits, only the first file changed, but not the second.
Collecting contextual information for each commit made to first file in the
previous release (4.3), we were able to predict all 3 commits in which both
files changed together in release 4.4, and we only issued 0 false
positives. For this pair of files, the most important contextual
information was the number of lines of code added, removed and modified in
each commit,the number of comments in each issue, and social network
measures (closeness, density, constraint, hierarchy) obtained from issue
comments.

- Do these results surprise you? Can you think in any explanation for the
results?

- Do you think that our rate of prediction is good enough to be used for
building tool support for the software community?

- Do you have any suggestion on what can be done to improve the change
recommendation?

You can visit our webpage to inspect the results in details:
http://flosscoach.com/index.php/17-cochanges/67-cloudstack

All the best,
Igor Wiese
Phd Candidate

Re: Feedback of my Phd work in Cloudstack Project

Posted by Igor Wiese <ig...@gmail.com>.
Hi Vadim!

In fact, we are recomending files to change together without the
developers/newcomer need to know about the code (structural dependencies
for example), or need to make debugging to find with files could change
together in a task.

We found many situations that files are changed together but there aren't
any "natural" reason for that. For example, they aren't structural
connected or in the same package. In such cases, it is not trivial to
"find" this coupling. Thus we can recommend at least some files to be
inspected by developers while they are perfoming changes.

The main ideia is "avoid" the incomplete change that could causes a new bug
can appeared, or avoid waisting time to inspect files/debugging system to
find files to change in a issue.

What do you think?

All the best,
Igor Wiese

2015-12-10 10:55 GMT-02:00 Vadim Kimlaychuk <va...@kickcloud.net>:

> Do I understand correctly that purpose of this work is to find tightly
> coupled classes automatically in order to inverse dependency later on?
>
> Vadim.
>
>
> On 2015-12-10 01:31, Igor Wiese wrote:
>
> Hi, Cloudstack Community.
>>
>> My name is Igor Wiese, phd Student from Brazil. In my research, I am
>> investigating two important questions: What makes two files change
>> together? Can we predict when they are going to co-change again?
>>
>> I've tried to investigate this question on the Cloudstack project. I've
>> collected data from issue reports, discussions and commits and using some
>> machine learning techniques to build a prediction model.
>>
>> I collected a total of 141 commits in which a pair of files changed
>> together and could correctly predict 60% commits. These were the most
>> useful information for predicting co-changes of files:
>>
>> - sum of number of lines of code added, modified and removed,
>>
>> - number of words used to describe and discuss the issues,
>>
>> - number of comments in each issue,
>>
>> - median value of closeness, a social network measure obtained from issue
>> comments, and
>>
>> - median value of constraint, a social network measure obtained from issue
>> comments.
>>
>> To illustrate, consider the following example from our analysis. For
>> release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
>> "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits. In
>> another 2 commits, only the first file changed, but not the second.
>> Collecting contextual information for each commit made to first file in
>> the
>> previous release (4.3), we were able to predict all 3 commits in which
>> both
>> files changed together in release 4.4, and we only issued 0 false
>> positives. For this pair of files, the most important contextual
>> information was the number of lines of code added, removed and modified in
>> each commit,the number of comments in each issue, and social network
>> measures (closeness, density, constraint, hierarchy) obtained from issue
>> comments.
>>
>> - Do these results surprise you? Can you think in any explanation for the
>> results?
>>
>> - Do you think that our rate of prediction is good enough to be used for
>> building tool support for the software community?
>>
>> - Do you have any suggestion on what can be done to improve the change
>> recommendation?
>>
>> You can visit our webpage to inspect the results in details:
>> http://flosscoach.com/index.php/17-cochanges/67-cloudstack [1]
>>
>> All the best,
>> Igor Wiese
>> Phd Candidate
>>
>
>
>
> Links:
> ------
> [1] http://flosscoach.com/index.php/17-cochanges/67-cloudstack
>



-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Feedback of my Phd work in Cloudstack Project

Posted by Vadim Kimlaychuk <va...@kickcloud.net>.
Do I understand correctly that purpose of this work is to find tightly 
coupled classes automatically in order to inverse dependency later on?

Vadim.

On 2015-12-10 01:31, Igor Wiese wrote:

> Hi, Cloudstack Community.
> 
> My name is Igor Wiese, phd Student from Brazil. In my research, I am
> investigating two important questions: What makes two files change
> together? Can we predict when they are going to co-change again?
> 
> I've tried to investigate this question on the Cloudstack project. I've
> collected data from issue reports, discussions and commits and using 
> some
> machine learning techniques to build a prediction model.
> 
> I collected a total of 141 commits in which a pair of files changed
> together and could correctly predict 60% commits. These were the most
> useful information for predicting co-changes of files:
> 
> - sum of number of lines of code added, modified and removed,
> 
> - number of words used to describe and discuss the issues,
> 
> - number of comments in each issue,
> 
> - median value of closeness, a social network measure obtained from 
> issue
> comments, and
> 
> - median value of constraint, a social network measure obtained from 
> issue
> comments.
> 
> To illustrate, consider the following example from our analysis. For
> release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits. 
> In
> another 2 commits, only the first file changed, but not the second.
> Collecting contextual information for each commit made to first file in 
> the
> previous release (4.3), we were able to predict all 3 commits in which 
> both
> files changed together in release 4.4, and we only issued 0 false
> positives. For this pair of files, the most important contextual
> information was the number of lines of code added, removed and modified 
> in
> each commit,the number of comments in each issue, and social network
> measures (closeness, density, constraint, hierarchy) obtained from 
> issue
> comments.
> 
> - Do these results surprise you? Can you think in any explanation for 
> the
> results?
> 
> - Do you think that our rate of prediction is good enough to be used 
> for
> building tool support for the software community?
> 
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
> 
> You can visit our webpage to inspect the results in details:
> http://flosscoach.com/index.php/17-cochanges/67-cloudstack [1]
> 
> All the best,
> Igor Wiese
> Phd Candidate



Links:
------
[1] http://flosscoach.com/index.php/17-cochanges/67-cloudstack

Re: Feedback of my Phd work in Cloudstack Project

Posted by Igor Wiese <ig...@gmail.com>.
Hi Anshul. Thanks for your answer

First of all, sorry about the webpage. I checked and now it is working
http://flosscoach.com/index.php/17-cochanges/67-cloudstack. Let me know if
you still having problem to access the webpage.

About your questions:

1) What do you mean by "correctly predict 60% commits”?
-  Let's suppose that you changed cloud/hypervisor/XenServerGuru.java in an
issue 10000. After commit this file, which other files you could change to
complete the changes? Then, we can collect data from previous
issues/commits when XenServerGuru.java changed in the previous release and
recomend to you which other files are more prone to change together in this
new issue that you are working. In 60% of the commits when we applied our
approach, we could correctly predict (recommend) files to change together
with cloud/hypervisor/XenServerGuru.java.

In the webpage you can check all "combinations" (pairs of files) that we
tested to cloudstack project based on releases 4.1, 4.2, 4.3 and 4.4

2) What are the feature measures you are giving as input here to system
(prediction model)?
In total we used 21 metrics: from Issues, communication/experience and
commit. Each pair of file that we tested used different combinations of
measures. From issues for example we used the name of assignee, reporter,
size of description+discussion. From the communication we got the number of
comments, if older commiters from the same size also made comments, social
network from issues/Pull Requests, etc. From commit the number of lines
added, modified, removed.

3) What kind of output you are expecting?
Let's suppose two real scenarios. You are a newcomer, you have difficult to
complete your issues because you don't read much code or don't know much
about the architecture. In such cases newcomers could use our approach (we
are building a tool) to receive recommendations while performing the task.

In the other hand, let's suppose that you are a core member and you are
reviewing the Pull Request, we could give you a list of files to check, if
all of them are in the commit.

All the best,
Igor Wiese


2015-12-10 7:11 GMT-02:00 Anshul Gangwar <an...@citrix.com>:

> Before giving feedback I have some questions
>
> 1) What do you mean by "correctly predict 60% commits”?
> 2) What are the feature measures you are giving as input here to system
> (prediction model)?
> 3) What kind of output you are expecting?
>
> Web page link you have provided is not working.
>
> > On 10-Dec-2015, at 5:01 AM, Igor Wiese <ig...@gmail.com> wrote:
> >
> > Hi, Cloudstack Community.
> >
> > My name is Igor Wiese, phd Student from Brazil. In my research, I am
> > investigating two important questions: What makes two files change
> > together? Can we predict when they are going to co-change again?
> >
> > I've tried to investigate this question on the Cloudstack project. I've
> > collected data from issue reports, discussions and commits and using some
> > machine learning techniques to build a prediction model.
> >
> > I collected a total of 141 commits in which a pair of files changed
> > together and could correctly predict 60% commits. These were the most
> > useful information for predicting co-changes of files:
> >
> > - sum of number of lines of code added, modified and removed,
> >
> > - number of words used to describe and discuss the issues,
> >
> > - number of comments in each issue,
> >
> > - median value of closeness, a social network measure obtained from issue
> > comments, and
> >
> > - median value of constraint, a social network measure obtained from
> issue
> > comments.
> >
> > To illustrate, consider the following example from our analysis. For
> > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits.
> In
> > another 2 commits, only the first file changed, but not the second.
> > Collecting contextual information for each commit made to first file in
> the
> > previous release (4.3), we were able to predict all 3 commits in which
> both
> > files changed together in release 4.4, and we only issued 0 false
> > positives. For this pair of files, the most important contextual
> > information was the number of lines of code added, removed and modified
> in
> > each commit,the number of comments in each issue, and social network
> > measures (closeness, density, constraint, hierarchy) obtained from issue
> > comments.
> >
> > - Do these results surprise you? Can you think in any explanation for the
> > results?
> >
> > - Do you think that our rate of prediction is good enough to be used for
> > building tool support for the software community?
> >
> > - Do you have any suggestion on what can be done to improve the change
> > recommendation?
> >
> > You can visit our webpage to inspect the results in details:
> > http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> >
> > All the best,
> > Igor Wiese
> > Phd Candidate
>
>


-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Feedback of my Phd work in Cloudstack Project

Posted by Anshul Gangwar <an...@citrix.com>.
Before giving feedback I have some questions

1) What do you mean by "correctly predict 60% commits”?
2) What are the feature measures you are giving as input here to system (prediction model)?
3) What kind of output you are expecting?

Web page link you have provided is not working.

> On 10-Dec-2015, at 5:01 AM, Igor Wiese <ig...@gmail.com> wrote:
> 
> Hi, Cloudstack Community.
> 
> My name is Igor Wiese, phd Student from Brazil. In my research, I am
> investigating two important questions: What makes two files change
> together? Can we predict when they are going to co-change again?
> 
> I've tried to investigate this question on the Cloudstack project. I've
> collected data from issue reports, discussions and commits and using some
> machine learning techniques to build a prediction model.
> 
> I collected a total of 141 commits in which a pair of files changed
> together and could correctly predict 60% commits. These were the most
> useful information for predicting co-changes of files:
> 
> - sum of number of lines of code added, modified and removed,
> 
> - number of words used to describe and discuss the issues,
> 
> - number of comments in each issue,
> 
> - median value of closeness, a social network measure obtained from issue
> comments, and
> 
> - median value of constraint, a social network measure obtained from issue
> comments.
> 
> To illustrate, consider the following example from our analysis. For
> release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits. In
> another 2 commits, only the first file changed, but not the second.
> Collecting contextual information for each commit made to first file in the
> previous release (4.3), we were able to predict all 3 commits in which both
> files changed together in release 4.4, and we only issued 0 false
> positives. For this pair of files, the most important contextual
> information was the number of lines of code added, removed and modified in
> each commit,the number of comments in each issue, and social network
> measures (closeness, density, constraint, hierarchy) obtained from issue
> comments.
> 
> - Do these results surprise you? Can you think in any explanation for the
> results?
> 
> - Do you think that our rate of prediction is good enough to be used for
> building tool support for the software community?
> 
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
> 
> You can visit our webpage to inspect the results in details:
> http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> 
> All the best,
> Igor Wiese
> Phd Candidate


Re: Feedback of my Phd work in Cloudstack Project

Posted by Patrick Dube <pa...@gmail.com>.
The history around the new file isn't the file itself, but in which
directory/package it would be in.

Cheers,

On Thu, Dec 10, 2015 at 3:01 PM, Igor Wiese <ig...@gmail.com> wrote:

> Hi Patrick
>
> The problem with new files is the absence of history to build the
> prediction models. I need at least some commits (10 commits for example).
> Yes, the link between files is what we are predicting. We can predict
> changes involving commands.properties, XML files in general, .txt files, or
> any source code extension :-)
>
> Thanks for the feedback.
>
>
> 2015-12-10 17:40 GMT-02:00 Patrick Dube <pa...@gmail.com>:
>
> > Are you handling new files as well, or the links between sets of files
> (or
> > packages)? As an example, if a user creates a new API cmd, then he will
> > update the "commands.properties" file. Another example, if a VO file is
> > updated, then there will be a db migration file added as well.
> > Cool work,
> >
> > On Thu, Dec 10, 2015 at 9:21 AM, Igor Wiese <ig...@gmail.com>
> wrote:
> >
> > > Hi Sebastien.
> > >
> > > We used only 141 commits because we needed data from the issues. As my
> > > assumption is related to the contextual information from Issues and
> > Social
> > > aspects, we need to aggregate commits and Issues.
> > >
> > > First, I collected the issues from JIRA and then i tryed to aggregate
> the
> > > commits that explicit made mentions to an issue collected. I only also
> > used
> > > closed issues to obtain the confidence that the code used to build my
> > > models have been merged and checked by the community.
> > >
> > > That is the weak point of my approach. I need the past data from the
> > > issues. Sometimes it is not available for past time.
> > > It is in my plan to use also data from github to make the dataset more
> > > complete.
> > >
> > > All the best,
> > >
> > > 2015-12-10 11:22 GMT-02:00 sebgoa <ru...@gmail.com>:
> > >
> > > >
> > > > On Dec 10, 2015, at 12:31 AM, Igor Wiese <ig...@gmail.com>
> wrote:
> > > >
> > > > > Hi, Cloudstack Community.
> > > > >
> > > > > My name is Igor Wiese, phd Student from Brazil. In my research, I
> am
> > > > > investigating two important questions: What makes two files change
> > > > > together? Can we predict when they are going to co-change again?
> > > > >
> > > > > I've tried to investigate this question on the Cloudstack project.
> > I've
> > > > > collected data from issue reports, discussions and commits and
> using
> > > some
> > > > > machine learning techniques to build a prediction model.
> > > > >
> > > > > I collected a total of 141 commits in which a pair of files changed
> > > > > together and could correctly predict 60% commits.
> > > >
> > > >
> > > > Hi Igor, why 141 commits ? Is that the only commits you found with
> > only a
> > > > pair for changes ?
> > > >
> > > > My gut feeling is that you could check the entire history of the
> > > > CloudStack repo (~5 years worth of data) and work on different type
> of
> > > > tuples.
> > > >
> > > > 141 commits seems like a really small dataset.
> > > >
> > > > -Sebastien
> > > >
> > > > > These were the most
> > > > > useful information for predicting co-changes of files:
> > > > >
> > > > > - sum of number of lines of code added, modified and removed,
> > > > >
> > > > > - number of words used to describe and discuss the issues,
> > > > >
> > > > > - number of comments in each issue,
> > > > >
> > > > > - median value of closeness, a social network measure obtained from
> > > issue
> > > > > comments, and
> > > > >
> > > > > - median value of constraint, a social network measure obtained
> from
> > > > issue
> > > > > comments.
> > > > >
> > > > > To illustrate, consider the following example from our analysis.
> For
> > > > > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> > > > > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3
> > commits.
> > > > In
> > > > > another 2 commits, only the first file changed, but not the second.
> > > > > Collecting contextual information for each commit made to first
> file
> > in
> > > > the
> > > > > previous release (4.3), we were able to predict all 3 commits in
> > which
> > > > both
> > > > > files changed together in release 4.4, and we only issued 0 false
> > > > > positives. For this pair of files, the most important contextual
> > > > > information was the number of lines of code added, removed and
> > modified
> > > > in
> > > > > each commit,the number of comments in each issue, and social
> network
> > > > > measures (closeness, density, constraint, hierarchy) obtained from
> > > issue
> > > > > comments.
> > > > >
> > > > > - Do these results surprise you? Can you think in any explanation
> for
> > > the
> > > > > results?
> > > > >
> > > > > - Do you think that our rate of prediction is good enough to be
> used
> > > for
> > > > > building tool support for the software community?
> > > > >
> > > > > - Do you have any suggestion on what can be done to improve the
> > change
> > > > > recommendation?
> > > > >
> > > > > You can visit our webpage to inspect the results in details:
> > > > > http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> > > > >
> > > > > All the best,
> > > > > Igor Wiese
> > > > > Phd Candidate
> > > >
> > > >
> > >
> > >
> > > --
> > > =================================
> > > Igor Scaliante Wiese
> > > PhD Candidate - Computer Science @ IME/USP
> > > Faculty in Dept. of Computing at Universidade Tecnológica Federal do
> > Paraná
> > >
> >
>
>
>
> --
> =================================
> Igor Scaliante Wiese
> PhD Candidate - Computer Science @ IME/USP
> Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná
>

Re: Feedback of my Phd work in Cloudstack Project

Posted by Igor Wiese <ig...@gmail.com>.
Hi Patrick

The problem with new files is the absence of history to build the
prediction models. I need at least some commits (10 commits for example).
Yes, the link between files is what we are predicting. We can predict
changes involving commands.properties, XML files in general, .txt files, or
any source code extension :-)

Thanks for the feedback.


2015-12-10 17:40 GMT-02:00 Patrick Dube <pa...@gmail.com>:

> Are you handling new files as well, or the links between sets of files (or
> packages)? As an example, if a user creates a new API cmd, then he will
> update the "commands.properties" file. Another example, if a VO file is
> updated, then there will be a db migration file added as well.
> Cool work,
>
> On Thu, Dec 10, 2015 at 9:21 AM, Igor Wiese <ig...@gmail.com> wrote:
>
> > Hi Sebastien.
> >
> > We used only 141 commits because we needed data from the issues. As my
> > assumption is related to the contextual information from Issues and
> Social
> > aspects, we need to aggregate commits and Issues.
> >
> > First, I collected the issues from JIRA and then i tryed to aggregate the
> > commits that explicit made mentions to an issue collected. I only also
> used
> > closed issues to obtain the confidence that the code used to build my
> > models have been merged and checked by the community.
> >
> > That is the weak point of my approach. I need the past data from the
> > issues. Sometimes it is not available for past time.
> > It is in my plan to use also data from github to make the dataset more
> > complete.
> >
> > All the best,
> >
> > 2015-12-10 11:22 GMT-02:00 sebgoa <ru...@gmail.com>:
> >
> > >
> > > On Dec 10, 2015, at 12:31 AM, Igor Wiese <ig...@gmail.com> wrote:
> > >
> > > > Hi, Cloudstack Community.
> > > >
> > > > My name is Igor Wiese, phd Student from Brazil. In my research, I am
> > > > investigating two important questions: What makes two files change
> > > > together? Can we predict when they are going to co-change again?
> > > >
> > > > I've tried to investigate this question on the Cloudstack project.
> I've
> > > > collected data from issue reports, discussions and commits and using
> > some
> > > > machine learning techniques to build a prediction model.
> > > >
> > > > I collected a total of 141 commits in which a pair of files changed
> > > > together and could correctly predict 60% commits.
> > >
> > >
> > > Hi Igor, why 141 commits ? Is that the only commits you found with
> only a
> > > pair for changes ?
> > >
> > > My gut feeling is that you could check the entire history of the
> > > CloudStack repo (~5 years worth of data) and work on different type of
> > > tuples.
> > >
> > > 141 commits seems like a really small dataset.
> > >
> > > -Sebastien
> > >
> > > > These were the most
> > > > useful information for predicting co-changes of files:
> > > >
> > > > - sum of number of lines of code added, modified and removed,
> > > >
> > > > - number of words used to describe and discuss the issues,
> > > >
> > > > - number of comments in each issue,
> > > >
> > > > - median value of closeness, a social network measure obtained from
> > issue
> > > > comments, and
> > > >
> > > > - median value of constraint, a social network measure obtained from
> > > issue
> > > > comments.
> > > >
> > > > To illustrate, consider the following example from our analysis. For
> > > > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> > > > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3
> commits.
> > > In
> > > > another 2 commits, only the first file changed, but not the second.
> > > > Collecting contextual information for each commit made to first file
> in
> > > the
> > > > previous release (4.3), we were able to predict all 3 commits in
> which
> > > both
> > > > files changed together in release 4.4, and we only issued 0 false
> > > > positives. For this pair of files, the most important contextual
> > > > information was the number of lines of code added, removed and
> modified
> > > in
> > > > each commit,the number of comments in each issue, and social network
> > > > measures (closeness, density, constraint, hierarchy) obtained from
> > issue
> > > > comments.
> > > >
> > > > - Do these results surprise you? Can you think in any explanation for
> > the
> > > > results?
> > > >
> > > > - Do you think that our rate of prediction is good enough to be used
> > for
> > > > building tool support for the software community?
> > > >
> > > > - Do you have any suggestion on what can be done to improve the
> change
> > > > recommendation?
> > > >
> > > > You can visit our webpage to inspect the results in details:
> > > > http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> > > >
> > > > All the best,
> > > > Igor Wiese
> > > > Phd Candidate
> > >
> > >
> >
> >
> > --
> > =================================
> > Igor Scaliante Wiese
> > PhD Candidate - Computer Science @ IME/USP
> > Faculty in Dept. of Computing at Universidade Tecnológica Federal do
> Paraná
> >
>



-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Feedback of my Phd work in Cloudstack Project

Posted by Patrick Dube <pa...@gmail.com>.
Are you handling new files as well, or the links between sets of files (or
packages)? As an example, if a user creates a new API cmd, then he will
update the "commands.properties" file. Another example, if a VO file is
updated, then there will be a db migration file added as well.
Cool work,

On Thu, Dec 10, 2015 at 9:21 AM, Igor Wiese <ig...@gmail.com> wrote:

> Hi Sebastien.
>
> We used only 141 commits because we needed data from the issues. As my
> assumption is related to the contextual information from Issues and Social
> aspects, we need to aggregate commits and Issues.
>
> First, I collected the issues from JIRA and then i tryed to aggregate the
> commits that explicit made mentions to an issue collected. I only also used
> closed issues to obtain the confidence that the code used to build my
> models have been merged and checked by the community.
>
> That is the weak point of my approach. I need the past data from the
> issues. Sometimes it is not available for past time.
> It is in my plan to use also data from github to make the dataset more
> complete.
>
> All the best,
>
> 2015-12-10 11:22 GMT-02:00 sebgoa <ru...@gmail.com>:
>
> >
> > On Dec 10, 2015, at 12:31 AM, Igor Wiese <ig...@gmail.com> wrote:
> >
> > > Hi, Cloudstack Community.
> > >
> > > My name is Igor Wiese, phd Student from Brazil. In my research, I am
> > > investigating two important questions: What makes two files change
> > > together? Can we predict when they are going to co-change again?
> > >
> > > I've tried to investigate this question on the Cloudstack project. I've
> > > collected data from issue reports, discussions and commits and using
> some
> > > machine learning techniques to build a prediction model.
> > >
> > > I collected a total of 141 commits in which a pair of files changed
> > > together and could correctly predict 60% commits.
> >
> >
> > Hi Igor, why 141 commits ? Is that the only commits you found with only a
> > pair for changes ?
> >
> > My gut feeling is that you could check the entire history of the
> > CloudStack repo (~5 years worth of data) and work on different type of
> > tuples.
> >
> > 141 commits seems like a really small dataset.
> >
> > -Sebastien
> >
> > > These were the most
> > > useful information for predicting co-changes of files:
> > >
> > > - sum of number of lines of code added, modified and removed,
> > >
> > > - number of words used to describe and discuss the issues,
> > >
> > > - number of comments in each issue,
> > >
> > > - median value of closeness, a social network measure obtained from
> issue
> > > comments, and
> > >
> > > - median value of constraint, a social network measure obtained from
> > issue
> > > comments.
> > >
> > > To illustrate, consider the following example from our analysis. For
> > > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> > > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits.
> > In
> > > another 2 commits, only the first file changed, but not the second.
> > > Collecting contextual information for each commit made to first file in
> > the
> > > previous release (4.3), we were able to predict all 3 commits in which
> > both
> > > files changed together in release 4.4, and we only issued 0 false
> > > positives. For this pair of files, the most important contextual
> > > information was the number of lines of code added, removed and modified
> > in
> > > each commit,the number of comments in each issue, and social network
> > > measures (closeness, density, constraint, hierarchy) obtained from
> issue
> > > comments.
> > >
> > > - Do these results surprise you? Can you think in any explanation for
> the
> > > results?
> > >
> > > - Do you think that our rate of prediction is good enough to be used
> for
> > > building tool support for the software community?
> > >
> > > - Do you have any suggestion on what can be done to improve the change
> > > recommendation?
> > >
> > > You can visit our webpage to inspect the results in details:
> > > http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> > >
> > > All the best,
> > > Igor Wiese
> > > Phd Candidate
> >
> >
>
>
> --
> =================================
> Igor Scaliante Wiese
> PhD Candidate - Computer Science @ IME/USP
> Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná
>

Re: Feedback of my Phd work in Cloudstack Project

Posted by Igor Wiese <ig...@gmail.com>.
Hi Sebastien.

We used only 141 commits because we needed data from the issues. As my
assumption is related to the contextual information from Issues and Social
aspects, we need to aggregate commits and Issues.

First, I collected the issues from JIRA and then i tryed to aggregate the
commits that explicit made mentions to an issue collected. I only also used
closed issues to obtain the confidence that the code used to build my
models have been merged and checked by the community.

That is the weak point of my approach. I need the past data from the
issues. Sometimes it is not available for past time.
It is in my plan to use also data from github to make the dataset more
complete.

All the best,

2015-12-10 11:22 GMT-02:00 sebgoa <ru...@gmail.com>:

>
> On Dec 10, 2015, at 12:31 AM, Igor Wiese <ig...@gmail.com> wrote:
>
> > Hi, Cloudstack Community.
> >
> > My name is Igor Wiese, phd Student from Brazil. In my research, I am
> > investigating two important questions: What makes two files change
> > together? Can we predict when they are going to co-change again?
> >
> > I've tried to investigate this question on the Cloudstack project. I've
> > collected data from issue reports, discussions and commits and using some
> > machine learning techniques to build a prediction model.
> >
> > I collected a total of 141 commits in which a pair of files changed
> > together and could correctly predict 60% commits.
>
>
> Hi Igor, why 141 commits ? Is that the only commits you found with only a
> pair for changes ?
>
> My gut feeling is that you could check the entire history of the
> CloudStack repo (~5 years worth of data) and work on different type of
> tuples.
>
> 141 commits seems like a really small dataset.
>
> -Sebastien
>
> > These were the most
> > useful information for predicting co-changes of files:
> >
> > - sum of number of lines of code added, modified and removed,
> >
> > - number of words used to describe and discuss the issues,
> >
> > - number of comments in each issue,
> >
> > - median value of closeness, a social network measure obtained from issue
> > comments, and
> >
> > - median value of constraint, a social network measure obtained from
> issue
> > comments.
> >
> > To illustrate, consider the following example from our analysis. For
> > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits.
> In
> > another 2 commits, only the first file changed, but not the second.
> > Collecting contextual information for each commit made to first file in
> the
> > previous release (4.3), we were able to predict all 3 commits in which
> both
> > files changed together in release 4.4, and we only issued 0 false
> > positives. For this pair of files, the most important contextual
> > information was the number of lines of code added, removed and modified
> in
> > each commit,the number of comments in each issue, and social network
> > measures (closeness, density, constraint, hierarchy) obtained from issue
> > comments.
> >
> > - Do these results surprise you? Can you think in any explanation for the
> > results?
> >
> > - Do you think that our rate of prediction is good enough to be used for
> > building tool support for the software community?
> >
> > - Do you have any suggestion on what can be done to improve the change
> > recommendation?
> >
> > You can visit our webpage to inspect the results in details:
> > http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> >
> > All the best,
> > Igor Wiese
> > Phd Candidate
>
>


-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Re: Feedback of my Phd work in Cloudstack Project

Posted by sebgoa <ru...@gmail.com>.
On Dec 10, 2015, at 12:31 AM, Igor Wiese <ig...@gmail.com> wrote:

> Hi, Cloudstack Community.
> 
> My name is Igor Wiese, phd Student from Brazil. In my research, I am
> investigating two important questions: What makes two files change
> together? Can we predict when they are going to co-change again?
> 
> I've tried to investigate this question on the Cloudstack project. I've
> collected data from issue reports, discussions and commits and using some
> machine learning techniques to build a prediction model.
> 
> I collected a total of 141 commits in which a pair of files changed
> together and could correctly predict 60% commits.


Hi Igor, why 141 commits ? Is that the only commits you found with only a pair for changes ?

My gut feeling is that you could check the entire history of the CloudStack repo (~5 years worth of data) and work on different type of tuples.

141 commits seems like a really small dataset.

-Sebastien

> These were the most
> useful information for predicting co-changes of files:
> 
> - sum of number of lines of code added, modified and removed,
> 
> - number of words used to describe and discuss the issues,
> 
> - number of comments in each issue,
> 
> - median value of closeness, a social network measure obtained from issue
> comments, and
> 
> - median value of constraint, a social network measure obtained from issue
> comments.
> 
> To illustrate, consider the following example from our analysis. For
> release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and
> "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits. In
> another 2 commits, only the first file changed, but not the second.
> Collecting contextual information for each commit made to first file in the
> previous release (4.3), we were able to predict all 3 commits in which both
> files changed together in release 4.4, and we only issued 0 false
> positives. For this pair of files, the most important contextual
> information was the number of lines of code added, removed and modified in
> each commit,the number of comments in each issue, and social network
> measures (closeness, density, constraint, hierarchy) obtained from issue
> comments.
> 
> - Do these results surprise you? Can you think in any explanation for the
> results?
> 
> - Do you think that our rate of prediction is good enough to be used for
> building tool support for the software community?
> 
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
> 
> You can visit our webpage to inspect the results in details:
> http://flosscoach.com/index.php/17-cochanges/67-cloudstack
> 
> All the best,
> Igor Wiese
> Phd Candidate