You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cxf.apache.org by igorwiese <ig...@gmail.com> on 2015/12/10 00:29:03 UTC

Feedback of my Phd work in CXF project

Hi, CXF Community. 

My name is Igor Wiese, phd Student from Brazil. I am investigating two
important questions: What makes two files change together? Can we predict
when they are going to co-change again? 

I've tried to investigate this question on the CXF project. I've collected
data from issue reports, discussions and commits and using some machine
learning techniques to build a prediction model.

I collected a total of 6384 commits in which a pair of files changed
together and could correctly predict 86% commits. These were the most useful
information for predicting co-changes of files: 
- number of lines of code added, 
- number of lines of code removed, 
- sum of number of lines of code added, modified and removed, 
- number of words used to describe and discuss the issues, and 
- number of comments in each issue.

To illustrate, consider the following example from our analysis. For release
2.7, the files "cxf/jaxrs/provider/AbstractJAXBProvider.java" and
"cxf/jaxrs/provider/JAXBElementProvider.java" changed together in 11
commits. In another 11 commits, only the first file changed, but not the
second. Collecting contextual information for each commit made to first file
in release 2.6, we were able to predict 9 commits in which both files
changed together in release 2.7, and we only issued one false positive, and
one wrong prediction. For this pair of files, the most important contextual
information was the number of lines of code added in each commit, the number
of lines of code removed in each commit, the sum of lines of code removed,
added and modified in each commit  and the number of words used to describe
and discuss the issues.

- Do these results surprise you? Can you think in any explanation for the
results?
- Do you think that our rate of prediction is good enough to be used for
building tool support for the software community?
- Do you have any suggestion on what can be done to improve the change
recommendation?

You can visit a webpage to inspect the results in details:
http://flosscoach.com/index.php/17-cochanges/68-cxf

All the best, 
Igor Wiese
Phd Candidate



--
View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765.html
Sent from the cxf-dev mailing list archive at Nabble.com.

Re: Feedback of my Phd work in CXF project

Posted by igorwiese <ig...@gmail.com>.
Thanks Sergey!



--
View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763790.html
Sent from the cxf-dev mailing list archive at Nabble.com.

Re: Feedback of my Phd work in CXF project

Posted by Sergey Beryozkin <sb...@gmail.com>.
Well, the long term contributors usually know how the files are 
connected but I agree with Christian it might help the newcomers 
navigate via a project, etc

Cheers, Sergey
On 10/12/15 14:05, igorwiese wrote:
> Thanks Sergey. This could be a good "next Step". I will think about it :-)
> And, how about the recommendations while you are performing changes? Do you
> think that based in the accuracy that we reported would be good to use some
> tool to help you in your tasks?
>
> It is difficult to you find files to change together in a task?
>
> All the best,
> Igor Wiese
>
>
>
> --
> View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763787.html
> Sent from the cxf-dev mailing list archive at Nabble.com.
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Feedback of my Phd work in CXF project

Posted by igorwiese <ig...@gmail.com>.
Thanks Sergey. This could be a good "next Step". I will think about it :-)
And, how about the recommendations while you are performing changes? Do you
think that based in the accuracy that we reported would be good to use some
tool to help you in your tasks?

It is difficult to you find files to change together in a task?

All the best,
Igor Wiese



--
View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763787.html
Sent from the cxf-dev mailing list archive at Nabble.com.

Re: Feedback of my Phd work in CXF project

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

What would also be interesting if your tool can say, this is a list of 
modules which are most actively developed in a given project (based on a 
number/frequency of changes to a given module) and as a next step - some 
hints on what does it mean with respect to a given technology.

So if your tool can say:

1) this is a list of modules being actively developed, and 2) This 
probably means these technologies are 'hot' etc, using the affected 
modules'POM descriptions and some RDF links as a source for such the 
conclusions :-)

Cheers, Sergey
On 10/12/15 13:03, igorwiese wrote:
> Hi Christian. Thanks for answer.
>
> Your first question is interesting. Usually this is the natural reason why
> we changed two files. We are always expecting some kind of structural
> connection between classes (eg implements, extends, instantiation, etc.).
> However we found many cases (issues) with commits where files are not
> "structurally connected".
>
> For example: JMSConduit.java and JMSOldConfigHolder.java are not
> structurally connected, despite being in the same package. We found that 15
> commits they changed together, but in other 18 commits only JMSConduit
> changed without the presence of JMSOldConfigHolder.java. If you consider a
> "natural" reason you can make 18 mistakes, or at least, you will lost your
> time inspecting JMSOldConfigHolder.java 18 times.
>
> Our assumption is that "this real reason" can be, in fact, "many different
> reason". Because of this, using only structural dependencies can be not good
> in all situations, and can misleading the developers.
>
> A simple scenario:
> - You are working in a issue, and committed the file JMSConduit.java. What
> other files you could change to complete this issue?
> - Based on the past issues/commits when JMSConduit was changed, we collect
> contextual information that describe the situations when JMSConduit changed
> or not with JMSOldConfigHolder.java, and then we can recommend you to
> inspect this file to change or not.
>
> We collect data from all possible combinations envolving JMSConduit and
> other files of the system.
> - What we are reporting is that in 86% of the cases that we tested this
> combinations (you can check all of combination in the website), we correctly
> predicted when both files will change together in an specific issue/commit.
>
> About the practical aspects (what can be done). A researcher from our
> research group interviewed newcomers and they said that it is difficult to
> find right files to change in their first contributions. In this case, as a
> newcomer is difficult to complete the issues/pull requests because they
> don't understand much the code or the architecture. Debugging tasks are also
> not trivial in all projects. In such cases newcomers could use our approach
> (we are building a tool) to receive recommendations while performing the
> task.
>
> In the other hand, let's suppose that you are a core member and you are
> reviewing the Pull Request, we could give you a list of files to check, if
> all of them are in the set of commits made to the issue/pull request. Of
> course we are not claiming that you need to stop the test cases or the
> continuous integration. It is another tool to help during the code review
> tasks.
>
> We are working in a prototype.. we don't know yeat if we will build a
> "monitor" as a web service that you could integrate inside the Issue
> tracker, or as a plugin to some IDE.
>
> So the main ideia here is "avoid" the incomplete change that could causes a
> new bug appearing, or avoid waisting time to inspect files/debugging system
> to find files to change in a issue.
>
>
>
>
> --
> View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763780.html
> Sent from the cxf-dev mailing list archive at Nabble.com.
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Feedback of my Phd work in CXF project

Posted by igorwiese <ig...@gmail.com>.
Hi Christian. Thank you again. You gave me good insights and ideas.

In fact there is other researcher checking if there are relation between
direct and "not direct" connections between classes and comparing then with
code changes. We found that structural dependencies are not "good
predictors" to code changes, but we are investigating more this situation.

In my case I did not consider structural dependencies because I use the
contextual information collected from "each commit". So, in this case,
probably this measure will be removed from my model because they will not
add explanation to predict the changes. However, it is in my plan to
consider this information as a "prior" indicator to boost my predictions.

Thanks again about the rich discussion.



--
View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763789.html
Sent from the cxf-dev mailing list archive at Nabble.com.

Re: Feedback of my Phd work in CXF project

Posted by Christian Schneider <ch...@die-schneider.net>.
Thanks for the explanations.

I think the idea of giving people some hint that classes are connected 
(according to your rules) makes a lot of sense.
Like you said this can help newcomers to navigate around the code. For 
this purpose an IDE plugin makes sense.

I am not sure about the review part. As a reviewer you will always see 
the changeset and you know you got to review
all changes. Your tool could report a file that was expected to be also 
changed but was not. As this would probably mean that
something was forgotten to change a test should catch that.

Btw. JMSConduit and JMSOldConfigHolder are connected through 
JMSConfiguration. So there is no direct connection but they are also not 
far from each other structurally.
I am not sure if your system takes the connections into account but it 
would surely make sense to do so. For example if you find out that files 
are often changed together you could graphically show how they are 
connected structurally. This together with the statistical predictions 
could be some real help.

Christian


On 10.12.2015 14:03, igorwiese wrote:
> Hi Christian. Thanks for answer.
>
> Your first question is interesting. Usually this is the natural reason why
> we changed two files. We are always expecting some kind of structural
> connection between classes (eg implements, extends, instantiation, etc.).
> However we found many cases (issues) with commits where files are not
> "structurally connected".
>
> For example: JMSConduit.java and JMSOldConfigHolder.java are not
> structurally connected, despite being in the same package. We found that 15
> commits they changed together, but in other 18 commits only JMSConduit
> changed without the presence of JMSOldConfigHolder.java. If you consider a
> "natural" reason you can make 18 mistakes, or at least, you will lost your
> time inspecting JMSOldConfigHolder.java 18 times.
>
> Our assumption is that "this real reason" can be, in fact, "many different
> reason". Because of this, using only structural dependencies can be not good
> in all situations, and can misleading the developers.
>
> A simple scenario:
> - You are working in a issue, and committed the file JMSConduit.java. What
> other files you could change to complete this issue?
> - Based on the past issues/commits when JMSConduit was changed, we collect
> contextual information that describe the situations when JMSConduit changed
> or not with JMSOldConfigHolder.java, and then we can recommend you to
> inspect this file to change or not.
>
> We collect data from all possible combinations envolving JMSConduit and
> other files of the system.
> - What we are reporting is that in 86% of the cases that we tested this
> combinations (you can check all of combination in the website), we correctly
> predicted when both files will change together in an specific issue/commit.
>
> About the practical aspects (what can be done). A researcher from our
> research group interviewed newcomers and they said that it is difficult to
> find right files to change in their first contributions. In this case, as a
> newcomer is difficult to complete the issues/pull requests because they
> don't understand much the code or the architecture. Debugging tasks are also
> not trivial in all projects. In such cases newcomers could use our approach
> (we are building a tool) to receive recommendations while performing the
> task.
>
> In the other hand, let's suppose that you are a core member and you are
> reviewing the Pull Request, we could give you a list of files to check, if
> all of them are in the set of commits made to the issue/pull request. Of
> course we are not claiming that you need to stop the test cases or the
> continuous integration. It is another tool to help during the code review
> tasks.
>
> We are working in a prototype.. we don't know yeat if we will build a
> "monitor" as a web service that you could integrate inside the Issue
> tracker, or as a plugin to some IDE.
>
> So the main ideia here is "avoid" the incomplete change that could causes a
> new bug appearing, or avoid waisting time to inspect files/debugging system
> to find files to change in a issue.
>
>
>
>
> --
> View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763780.html
> Sent from the cxf-dev mailing list archive at Nabble.com.


-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com


Re: Feedback of my Phd work in CXF project

Posted by igorwiese <ig...@gmail.com>.
Hi Christian. Thanks for answer.

Your first question is interesting. Usually this is the natural reason why
we changed two files. We are always expecting some kind of structural
connection between classes (eg implements, extends, instantiation, etc.).
However we found many cases (issues) with commits where files are not
"structurally connected". 

For example: JMSConduit.java and JMSOldConfigHolder.java are not
structurally connected, despite being in the same package. We found that 15
commits they changed together, but in other 18 commits only JMSConduit
changed without the presence of JMSOldConfigHolder.java. If you consider a
"natural" reason you can make 18 mistakes, or at least, you will lost your
time inspecting JMSOldConfigHolder.java 18 times.

Our assumption is that "this real reason" can be, in fact, "many different
reason". Because of this, using only structural dependencies can be not good
in all situations, and can misleading the developers. 

A simple scenario:
- You are working in a issue, and committed the file JMSConduit.java. What
other files you could change to complete this issue?
- Based on the past issues/commits when JMSConduit was changed, we collect
contextual information that describe the situations when JMSConduit changed
or not with JMSOldConfigHolder.java, and then we can recommend you to
inspect this file to change or not.

We collect data from all possible combinations envolving JMSConduit and
other files of the system.
- What we are reporting is that in 86% of the cases that we tested this
combinations (you can check all of combination in the website), we correctly
predicted when both files will change together in an specific issue/commit.

About the practical aspects (what can be done). A researcher from our
research group interviewed newcomers and they said that it is difficult to
find right files to change in their first contributions. In this case, as a
newcomer is difficult to complete the issues/pull requests because they
don't understand much the code or the architecture. Debugging tasks are also
not trivial in all projects. In such cases newcomers could use our approach
(we are building a tool) to receive recommendations while performing the
task. 

In the other hand, let's suppose that you are a core member and you are
reviewing the Pull Request, we could give you a list of files to check, if
all of them are in the set of commits made to the issue/pull request. Of
course we are not claiming that you need to stop the test cases or the
continuous integration. It is another tool to help during the code review
tasks.

We are working in a prototype.. we don't know yeat if we will build a
"monitor" as a web service that you could integrate inside the Issue
tracker, or as a plugin to some IDE.

So the main ideia here is "avoid" the incomplete change that could causes a
new bug appearing, or avoid waisting time to inspect files/debugging system
to find files to change in a issue.




--
View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765p5763780.html
Sent from the cxf-dev mailing list archive at Nabble.com.

Re: Feedback of my Phd work in CXF project

Posted by Christian Schneider <ch...@die-schneider.net>.
The criteria you mention sound a bit academic.

Isn't the real reason of the combined change rather that the classes are 
connected to each other?
I also do not yet get what the use cases for the prediction are.

Of course it is interesting that you can predict such changes but what 
can be done with this information.

Christian

On 10.12.2015 00:29, igorwiese wrote:
> Hi, CXF Community.
>
> My name is Igor Wiese, phd Student from Brazil. I am investigating two
> important questions: What makes two files change together? Can we predict
> when they are going to co-change again?
>
> I've tried to investigate this question on the CXF project. I've collected
> data from issue reports, discussions and commits and using some machine
> learning techniques to build a prediction model.
>
> I collected a total of 6384 commits in which a pair of files changed
> together and could correctly predict 86% commits. These were the most useful
> information for predicting co-changes of files:
> - number of lines of code added,
> - number of lines of code removed,
> - sum of number of lines of code added, modified and removed,
> - number of words used to describe and discuss the issues, and
> - number of comments in each issue.
>
> To illustrate, consider the following example from our analysis. For release
> 2.7, the files "cxf/jaxrs/provider/AbstractJAXBProvider.java" and
> "cxf/jaxrs/provider/JAXBElementProvider.java" changed together in 11
> commits. In another 11 commits, only the first file changed, but not the
> second. Collecting contextual information for each commit made to first file
> in release 2.6, we were able to predict 9 commits in which both files
> changed together in release 2.7, and we only issued one false positive, and
> one wrong prediction. For this pair of files, the most important contextual
> information was the number of lines of code added in each commit, the number
> of lines of code removed in each commit, the sum of lines of code removed,
> added and modified in each commit  and the number of words used to describe
> and discuss the issues.
>
> - Do these results surprise you? Can you think in any explanation for the
> results?
> - Do you think that our rate of prediction is good enough to be used for
> building tool support for the software community?
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
>
> You can visit a webpage to inspect the results in details:
> http://flosscoach.com/index.php/17-cochanges/68-cxf
>
> All the best,
> Igor Wiese
> Phd Candidate
>
>
>
> --
> View this message in context: http://cxf.547215.n5.nabble.com/Feedback-of-my-Phd-work-in-CXF-project-tp5763765.html
> Sent from the cxf-dev mailing list archive at Nabble.com.


-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com