You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Igor Wiese <ig...@gmail.com> on 2015/12/10 00:48:10 UTC

Feedback of my Phd work in Lucene and Solr project

Hi, Lucene and Solr Community.

My name is Igor Wiese, phd Student from Brazil. In my research I am
investigating two important questions: What makes two files change
together? Can we predict when they are going to co-change again?

I've tried to investigate this question on the Lucene and Solr project.
I've collected data from issue reports, discussions and commits and using
some machine learning techniques to build a prediction model.

I collected a total of 1382 commits in which a pair of files changed
together and could correctly predict 66% commits in the Lucene Project. For
the Solr Project I collected a total of 111 commits in which a pair of
files changed together and could correctly predict 47% commits.

These were the most useful information for predicting co-changes of files:

- number of lines of code added,

- number of lines of code removed,

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues, and

- median value of closeness, a social network measure obtained from issue
comments.

To illustrate, consider the following example in Lucene Project from our
analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and
"lucene/index/StandardDirectoryReader.java" changed together in 4 commits.
In another 11 commits, only the first file changed, but not the second.
Collecting contextual information for each commit made to first file in
previous release, we were able to predict 3 commits in which both files
changed together in release 4.7, and we issued 0 false positive, and one
wrong prediction. For this pair of files, the most important contextual
information was the number of lines of code added in each commit, the
number of words used to describe and discuss the issues, the number of
comments in each issue and the social network metric (closeness) obtained
from issue comments.

- Do these results surprise you? Can you think in any explanation for the
results?

- Do you think that our rate of prediction is good enough to be used for
building tool support for the software community?

- Do you have any suggestion on what can be done to improve the change
recommendation?

You can visit a webpage to inspect the results in details:

Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-lucene
Solr Project: http://flosscoach.com/index.php/17-cochanges/74-solr

All the best,
Igor Wiese
Phd Candidate

Re: Feedback of my Phd work in Lucene and Solr project

Posted by Igor Wiese <ig...@gmail.com>.
Hi Uwe

Thanks for helping me! I will inspect these results also!

All the best,
Igor Wiese

2015-12-10 16:45 GMT-02:00 Uwe Schindler <uw...@thetaphi.de>:

> Hi,
>
> > We used commits recorded in SVN, not Git. Probably we minimized the
> > problem, but we got much less commits. In fact we analyzed 4 releases
> > from SOLr (1.1, 1.2, 1.3 and 1.4 was the last) and 10 releases from
> > Lucene.
>
> The same applies to SVN, too. I just posted the GIT links for easier
> review.
>
> Solr is (like Lucene) currently on version 5.3.1. Solr 1.4 is more than 5
> years old, because this part of SVN is no longer used. As said before, you
> were looking at the whole Lucene/Solr reporistoy in your Lucene analysis
> (for all releases >= 3.0, which was the first merged one. Since 3.0 Lucene
> and Solr are released from one repository at the same time). You can see
> this easily in your table:
>
> http://flosscoach.com/index.php/17-cochanges/73-lucene
> See 4th line, starts with a solr and a lucene filename (subdirs lucene and
> solr). So you definitely looked at Lucene and Solr at the same time.
>
> Because of this the number of commits in the >5 year old solr repository
> are not comparable to those commits for the whole new merged project.
>
> > About the heavy commit. We found in some cases more than one commit to
> > a single issue, and we also tried to identify heavy commit's. However,
> > we also minimize the number of heavy commits filtering commits with
> > more than 20 files (to inspect) and removing commits not associated
> > with issues in the JIRA.
> >
> > Just to let you know that I will report these situations pointed by
> > you in my "threat of validity!" discussion.
> >
> > I really liked your suggestion to inspect changes between
> > sub-projects. I will think a way to evaluate changes
> > "inter-subprojects". We analyzed each project as an individual source,
> > but, it would be very interesting to check this relation when changes
> > in Solr causes change in Lucene and vice-versa. It is a good future
> > work to do.
>
> In recent Lucene analysis reports (>= 3.0)  you show changes from both
> projects, see above!
>
> > Thanks for your comment and opinion about changes in IndexWriter and
> > StandardDirectoryWriter
>
> Thanks, too.
>
> Greetings,
> Uwe
>
> > All the best,
> >
> > Igor
> >
> >
> > 2015-12-10 12:36 GMT-02:00 Uwe Schindler <uw...@thetaphi.de>:
> > >
> > > Hi,
> > >
> > >
> > >
> > > There is one general problem in the analysis:
> > >
> > >
> > >
> > > Since approx 5 1/2 years, Lucene and Solr are now one project and no
> > longer separated (see
> > https://github.com/apache/solr/blob/trunk/trunk_development_moved.txt
> > ;
> > https://github.com/apache/lucene/blob/trunk/trunk_development_moved
> > .txt). You can still look at commits of Solr and Lucene separately, but
> both are
> > in one repository: https://github.com/apache/lucene-solr:
> > >
> > > -          trunk/lucene (master/lucene in Github): Contains code of
> Lucene
> > >
> > > -          trunk/solr (master/solr in Github): Contains code of Solr
> > >
> > > -          trunk/dev-tools: shared development scripts, release
> management
> > scripts
> > >
> > >
> > >
> > > But you have to be aware that many commits also change both files
> > (because projects are linked together. When you change something like the
> > API in Lucene, you have to commit at same time also the adoption in
> Solr).
> > You see this in the first report, which was declared in your mail to
> only cover
> > Lucene, but in fact it was looking at the whole project. Many source code
> > pairs also contained classes from both sub-projects.
> > >
> > >
> > >
> > > The reason why the number of commits in what you think of “Solr” was
> low
> > is easy to see: You just looked at the Solr version 1.4, which was the
> last
> > release of Solr alone. The old location in SVN is no longer used! The
> number
> > of commits is lower, because if you look at both projects at the same
> time,
> > you will see way more commits (your first statistic)
> > >
> > >
> > >
> > > MG>if the pairing was 100% accurate then yes a predictor for both files
> > changing indicates a design issue is lurking i.e
> > > MG>IndexWriter and StandardDirectoryWriter "share functionality" which
> > would suggest breaking shared methods to interface
> > > MG>refactoring IndexWriter and StandardDirectoryReader to each
> > implement that shared Interface
> > > MG>if attributes are to be shared then perhaps an abstract class
> should be
> > created to contain those shared attributes and implement
> > > MG>the shared methods
> > > MG>refactoring IndexWriter and StandardDirectoryReader to extend the
> > abstract class should force implementor to override/reuse
> > > MG>shared attributes in the Abstract Base Class?
> > >
> > >
> > >
> > > Keep in mind that a change in those 2 files may also mean sonmething
> else:
> > IndexWriter uses StandardDirectoryReader to delegate some tasks (Near
> > Realtime Reader support, so there is no code duplication, both are just
> > working together very closely). If you change StandardDirectoryReader’s
> > package private APIs (but also public APIs) and IndexWriter calls them,
> you
> > have to change the caller, too. This is why you see a change on
> > StandardDirectoryReader together with a change in IndexWriter. But the
> > other way round is different: Changes in IndexWriter seldomly cause
> > changes in StandardDirectoryReader, so its correct what you see.
> > >
> > >
> > >
> > > The same applies to Solr and Lucene: A change in Solr seldomly causes a
> > change in Lucene, but API breaking changes in Lucene always cause a
> change
> > in Solr. The Lucene 4 series (around 4.7, which you looked at) was very
> active
> > in redoing APIs, also we changed to Java 7 in 4.8. So it is very likely
> that you
> > see changes in Lucene that cause changes in Solr. The argumentation is
> the
> > same: Whenever you break API, you have to fix callers, too. E.g. Lucene
> 4.8
> > had many code refactoring issues (move from Java 6 to Java 7), cleanups
> of
> > Javadocs,… which naturally touch many files.
> > >
> > >
> > >
> > > But in general for a subversion based-project like Lucene/Solr it is
> very hard
> > to rectify a design issue like Martin Gainty did. In the Apache Software
> > Foundation based Subversion projects (nor Git), we generally have a
> single
> > commit after the whole issue is resolved. We don’t record changes during
> > development of those patches. Some of the patches that came out are very
> > large changes and committed after several days / weeks of work. The
> > likelyhood that you see changes in many unrelated files is high. This
> differs
> > from Git-based projects, where you commit much more often (we call it
> > “heavy commit”).
> > >
> > > Uwe
> > >
> > >
> > >
> > > -----
> > >
> > > Uwe Schindler
> > >
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > >
> > > http://www.thetaphi.de
> > >
> > > eMail: uwe@thetaphi.de
> > >
> > >
> > >
> > > From: Igor Wiese [mailto:igor.wiese@gmail.com]
> > > Sent: Thursday, December 10, 2015 2:59 PM
> > > To: dev@lucene.apache.org
> > > Subject: Re: Feedback of my Phd work in Lucene and Solr project
> > >
> > >
> > >
> > > Hi MG. Thanks for the portuguese :-)
> > >
> > > I really enjoyed your example. I don't know much about the Lucene/Solr
> > architecture of, but I completely agree. Probably, there is a design
> problem in
> > this case because the classes seem to be "related". But some of pairs of
> files
> > that we tested, we couldn't make assumptions because it is not clear why
> > the classes changed together. We probably need to manually inspect the
> set
> > issues where the files changed together to find the "reason". In some
> cases,
> > could be very difficuld without have enough know-how of the project.
> > >
> > > The good point is that "for a newcomer", for example, it would be hard
> to
> > find the relation that you mentioned. In such cases we could help :). Do
> you
> > agree?
> > >
> > > I really enjoyed the ideia of "maven plugin". We are creating a tool
> like a
> > "web service" that could be integrated with the Issue Tracker, but.. i
> really
> > liked your ideia. I will think about it. Thanks!
> > >
> > > Probably we couldn't predict with 100% of accuracy in all of cases
> :-). In
> > average, as I mentioned, to Lucene we tested more than 1000 commits with
> > 66% of accuracy. To solr the accuracy was low (47%). Probably, the
> reason to
> > this low accuracy in Solr is related to the number of commits that we
> used to
> > construct the prediction models. We used 10x less commits in Solr than
> > Lucene.
> > >
> > > Considering that in each 4 commits, in 3 of them we could give you good
> > recomendations to change two files together, is good? Do you think that
> > could "save" your time to find the correct files to complete the change?
> > >
> > > Thanks Again, MG
> > >
> > > All the best,
> > >
> > > Igor Wiese
> > >
> > >
> > >
> > >
> > >
> > > 2015-12-10 11:21 GMT-02:00 Martin Gainty <mg...@hotmail.com>:
> > >
> > >
> > >
> > >
> > > ________________________________
> > >
> > > From: igor.wiese@gmail.com
> > > Date: Wed, 9 Dec 2015 23:48:10 +0000
> > > Subject: Feedback of my Phd work in Lucene and Solr project
> > > To: dev@lucene.apache.org
> > >
> > > Hi, Lucene and Solr Community.
> > >
> > >
> > >
> > > My name is Igor Wiese, phd Student from Brazil. In my research I am
> > investigating two important questions: What makes two files change
> > together? Can we predict when they are going to co-change again?
> > >
> > >
> > >
> > > I've tried to investigate this question on the Lucene and Solr
> project. I've
> > collected data from issue reports, discussions and commits and using some
> > machine learning techniques to build a prediction model.
> > >
> > >
> > >
> > > I collected a total of 1382 commits in which a pair of files changed
> together
> > and could correctly predict 66% commits in the Lucene Project. For the
> Solr
> > Project I collected a total of 111 commits in which a pair of files
> changed
> > together and could correctly predict 47% commits.
> > >
> > >
> > >
> > > These were the most useful information for predicting co-changes of
> files:
> > >
> > > - number of lines of code added,
> > >
> > > - number of lines of code removed,
> > >
> > > - sum of number of lines of code added, modified and removed,
> > >
> > > - number of words used to describe and discuss the issues, and
> > >
> > > - median value of closeness, a social network measure obtained from
> issue
> > comments.
> > >
> > >
> > >
> > > To illustrate, consider the following example in Lucene Project from
> our
> > analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and
> > "lucene/index/StandardDirectoryReader.java" changed together in 4
> > commits. In another 11 commits, only the first file changed, but not the
> > second. Collecting contextual information for each commit made to first
> file
> > in previous release, we were able to predict 3 commits in which both
> files
> > changed together in release 4.7, and we issued 0 false positive, and one
> > wrong prediction. For this pair of files, the most important contextual
> > information was the number of lines of code added in each commit, the
> > number of words used to describe and discuss the issues, the number of
> > comments in each issue and the social network metric (closeness) obtained
> > from issue comments.
> > >
> > > MG>if the pairing was 100% accurate then yes a predictor for both files
> > changing indicates a design issue is lurking i.e
> > > MG>IndexWriter and StandardDirectoryWriter "share functionality" which
> > would suggest breaking shared methods to interface
> > > MG>refactoring IndexWriter and StandardDirectoryReader to each
> > implement that shared Interface
> > > MG>if attributes are to be shared then perhaps an abstract class
> should be
> > created to contain those shared attributes and implement
> > > MG>the shared methods
> > > MG>refactoring IndexWriter and StandardDirectoryReader to extend the
> > abstract class should force implementor to override/reuse
> > > MG>shared attributes in the Abstract Base Class?
> > >
> > >
> > >
> > > - Do these results surprise you? Can you think in any explanation for
> the
> > results?
> > >
> > > - Do you think that our rate of prediction is good enough to be used
> for
> > building tool support for the software community?
> > >
> > > MG>if the plugin can predict with 100% accuracy?
> > >
> > > - Do you have any suggestion on what can be done to improve the change
> > recommendation?
> > >
> > > MG>create the tool as a maven plugin so we can bind this functionality
> to
> > one of the pre compile phases e.g. process-sources?
> > >
> > >
> > >
> > > You can visit a webpage to inspect the results in details:
> > >
> > > Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-lucene
> > >
> > > Solr Project: http://flosscoach.com/index.php/17-cochanges/74-solr
> > >
> > > All the best,
> > > Igor Wiese
> > >
> > > Phd Candidate
> > >
> > >
> > >
> > > MG>Obrigado do EEUU
> > >
> > >
> > >
> > >
> > > --
> > >
> > > =================================
> > > Igor Scaliante Wiese
> > > PhD Candidate - Computer Science @ IME/USP
> > > Faculty in Dept. of Computing at Universidade Tecnológica Federal do
> > Paraná
> >
> >
> >
> >
> > --
> > =================================
> > Igor Scaliante Wiese
> > PhD Candidate - Computer Science @ IME/USP
> > Faculty in Dept. of Computing at Universidade Tecnológica Federal do
> Paraná
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

RE: Feedback of my Phd work in Lucene and Solr project

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

> We used commits recorded in SVN, not Git. Probably we minimized the
> problem, but we got much less commits. In fact we analyzed 4 releases
> from SOLr (1.1, 1.2, 1.3 and 1.4 was the last) and 10 releases from
> Lucene.

The same applies to SVN, too. I just posted the GIT links for easier review.

Solr is (like Lucene) currently on version 5.3.1. Solr 1.4 is more than 5 years old, because this part of SVN is no longer used. As said before, you were looking at the whole Lucene/Solr reporistoy in your Lucene analysis (for all releases >= 3.0, which was the first merged one. Since 3.0 Lucene and Solr are released from one repository at the same time). You can see this easily in your table:

http://flosscoach.com/index.php/17-cochanges/73-lucene 
See 4th line, starts with a solr and a lucene filename (subdirs lucene and solr). So you definitely looked at Lucene and Solr at the same time.

Because of this the number of commits in the >5 year old solr repository are not comparable to those commits for the whole new merged project.

> About the heavy commit. We found in some cases more than one commit to
> a single issue, and we also tried to identify heavy commit's. However,
> we also minimize the number of heavy commits filtering commits with
> more than 20 files (to inspect) and removing commits not associated
> with issues in the JIRA.
> 
> Just to let you know that I will report these situations pointed by
> you in my "threat of validity!" discussion.
> 
> I really liked your suggestion to inspect changes between
> sub-projects. I will think a way to evaluate changes
> "inter-subprojects". We analyzed each project as an individual source,
> but, it would be very interesting to check this relation when changes
> in Solr causes change in Lucene and vice-versa. It is a good future
> work to do.

In recent Lucene analysis reports (>= 3.0)  you show changes from both projects, see above!

> Thanks for your comment and opinion about changes in IndexWriter and
> StandardDirectoryWriter

Thanks, too.

Greetings,
Uwe

> All the best,
> 
> Igor
> 
> 
> 2015-12-10 12:36 GMT-02:00 Uwe Schindler <uw...@thetaphi.de>:
> >
> > Hi,
> >
> >
> >
> > There is one general problem in the analysis:
> >
> >
> >
> > Since approx 5 1/2 years, Lucene and Solr are now one project and no
> longer separated (see
> https://github.com/apache/solr/blob/trunk/trunk_development_moved.txt
> ;
> https://github.com/apache/lucene/blob/trunk/trunk_development_moved
> .txt). You can still look at commits of Solr and Lucene separately, but both are
> in one repository: https://github.com/apache/lucene-solr:
> >
> > -          trunk/lucene (master/lucene in Github): Contains code of Lucene
> >
> > -          trunk/solr (master/solr in Github): Contains code of Solr
> >
> > -          trunk/dev-tools: shared development scripts, release management
> scripts
> >
> >
> >
> > But you have to be aware that many commits also change both files
> (because projects are linked together. When you change something like the
> API in Lucene, you have to commit at same time also the adoption in Solr).
> You see this in the first report, which was declared in your mail to only cover
> Lucene, but in fact it was looking at the whole project. Many source code
> pairs also contained classes from both sub-projects.
> >
> >
> >
> > The reason why the number of commits in what you think of “Solr” was low
> is easy to see: You just looked at the Solr version 1.4, which was the last
> release of Solr alone. The old location in SVN is no longer used! The number
> of commits is lower, because if you look at both projects at the same time,
> you will see way more commits (your first statistic)
> >
> >
> >
> > MG>if the pairing was 100% accurate then yes a predictor for both files
> changing indicates a design issue is lurking i.e
> > MG>IndexWriter and StandardDirectoryWriter "share functionality" which
> would suggest breaking shared methods to interface
> > MG>refactoring IndexWriter and StandardDirectoryReader to each
> implement that shared Interface
> > MG>if attributes are to be shared then perhaps an abstract class should be
> created to contain those shared attributes and implement
> > MG>the shared methods
> > MG>refactoring IndexWriter and StandardDirectoryReader to extend the
> abstract class should force implementor to override/reuse
> > MG>shared attributes in the Abstract Base Class?
> >
> >
> >
> > Keep in mind that a change in those 2 files may also mean sonmething else:
> IndexWriter uses StandardDirectoryReader to delegate some tasks (Near
> Realtime Reader support, so there is no code duplication, both are just
> working together very closely). If you change StandardDirectoryReader’s
> package private APIs (but also public APIs) and IndexWriter calls them, you
> have to change the caller, too. This is why you see a change on
> StandardDirectoryReader together with a change in IndexWriter. But the
> other way round is different: Changes in IndexWriter seldomly cause
> changes in StandardDirectoryReader, so its correct what you see.
> >
> >
> >
> > The same applies to Solr and Lucene: A change in Solr seldomly causes a
> change in Lucene, but API breaking changes in Lucene always cause a change
> in Solr. The Lucene 4 series (around 4.7, which you looked at) was very active
> in redoing APIs, also we changed to Java 7 in 4.8. So it is very likely that you
> see changes in Lucene that cause changes in Solr. The argumentation is the
> same: Whenever you break API, you have to fix callers, too. E.g. Lucene 4.8
> had many code refactoring issues (move from Java 6 to Java 7), cleanups of
> Javadocs,… which naturally touch many files.
> >
> >
> >
> > But in general for a subversion based-project like Lucene/Solr it is very hard
> to rectify a design issue like Martin Gainty did. In the Apache Software
> Foundation based Subversion projects (nor Git), we generally have a single
> commit after the whole issue is resolved. We don’t record changes during
> development of those patches. Some of the patches that came out are very
> large changes and committed after several days / weeks of work. The
> likelyhood that you see changes in many unrelated files is high. This differs
> from Git-based projects, where you commit much more often (we call it
> “heavy commit”).
> >
> > Uwe
> >
> >
> >
> > -----
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: uwe@thetaphi.de
> >
> >
> >
> > From: Igor Wiese [mailto:igor.wiese@gmail.com]
> > Sent: Thursday, December 10, 2015 2:59 PM
> > To: dev@lucene.apache.org
> > Subject: Re: Feedback of my Phd work in Lucene and Solr project
> >
> >
> >
> > Hi MG. Thanks for the portuguese :-)
> >
> > I really enjoyed your example. I don't know much about the Lucene/Solr
> architecture of, but I completely agree. Probably, there is a design problem in
> this case because the classes seem to be "related". But some of pairs of files
> that we tested, we couldn't make assumptions because it is not clear why
> the classes changed together. We probably need to manually inspect the set
> issues where the files changed together to find the "reason". In some cases,
> could be very difficuld without have enough know-how of the project.
> >
> > The good point is that "for a newcomer", for example, it would be hard to
> find the relation that you mentioned. In such cases we could help :). Do you
> agree?
> >
> > I really enjoyed the ideia of "maven plugin". We are creating a tool like a
> "web service" that could be integrated with the Issue Tracker, but.. i really
> liked your ideia. I will think about it. Thanks!
> >
> > Probably we couldn't predict with 100% of accuracy in all of cases :-). In
> average, as I mentioned, to Lucene we tested more than 1000 commits with
> 66% of accuracy. To solr the accuracy was low (47%). Probably, the reason to
> this low accuracy in Solr is related to the number of commits that we used to
> construct the prediction models. We used 10x less commits in Solr than
> Lucene.
> >
> > Considering that in each 4 commits, in 3 of them we could give you good
> recomendations to change two files together, is good? Do you think that
> could "save" your time to find the correct files to complete the change?
> >
> > Thanks Again, MG
> >
> > All the best,
> >
> > Igor Wiese
> >
> >
> >
> >
> >
> > 2015-12-10 11:21 GMT-02:00 Martin Gainty <mg...@hotmail.com>:
> >
> >
> >
> >
> > ________________________________
> >
> > From: igor.wiese@gmail.com
> > Date: Wed, 9 Dec 2015 23:48:10 +0000
> > Subject: Feedback of my Phd work in Lucene and Solr project
> > To: dev@lucene.apache.org
> >
> > Hi, Lucene and Solr Community.
> >
> >
> >
> > My name is Igor Wiese, phd Student from Brazil. In my research I am
> investigating two important questions: What makes two files change
> together? Can we predict when they are going to co-change again?
> >
> >
> >
> > I've tried to investigate this question on the Lucene and Solr project. I've
> collected data from issue reports, discussions and commits and using some
> machine learning techniques to build a prediction model.
> >
> >
> >
> > I collected a total of 1382 commits in which a pair of files changed together
> and could correctly predict 66% commits in the Lucene Project. For the Solr
> Project I collected a total of 111 commits in which a pair of files changed
> together and could correctly predict 47% commits.
> >
> >
> >
> > These were the most useful information for predicting co-changes of files:
> >
> > - number of lines of code added,
> >
> > - number of lines of code removed,
> >
> > - sum of number of lines of code added, modified and removed,
> >
> > - number of words used to describe and discuss the issues, and
> >
> > - median value of closeness, a social network measure obtained from issue
> comments.
> >
> >
> >
> > To illustrate, consider the following example in Lucene Project from our
> analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and
> "lucene/index/StandardDirectoryReader.java" changed together in 4
> commits. In another 11 commits, only the first file changed, but not the
> second. Collecting contextual information for each commit made to first file
> in previous release, we were able to predict 3 commits in which both files
> changed together in release 4.7, and we issued 0 false positive, and one
> wrong prediction. For this pair of files, the most important contextual
> information was the number of lines of code added in each commit, the
> number of words used to describe and discuss the issues, the number of
> comments in each issue and the social network metric (closeness) obtained
> from issue comments.
> >
> > MG>if the pairing was 100% accurate then yes a predictor for both files
> changing indicates a design issue is lurking i.e
> > MG>IndexWriter and StandardDirectoryWriter "share functionality" which
> would suggest breaking shared methods to interface
> > MG>refactoring IndexWriter and StandardDirectoryReader to each
> implement that shared Interface
> > MG>if attributes are to be shared then perhaps an abstract class should be
> created to contain those shared attributes and implement
> > MG>the shared methods
> > MG>refactoring IndexWriter and StandardDirectoryReader to extend the
> abstract class should force implementor to override/reuse
> > MG>shared attributes in the Abstract Base Class?
> >
> >
> >
> > - Do these results surprise you? Can you think in any explanation for the
> results?
> >
> > - Do you think that our rate of prediction is good enough to be used for
> building tool support for the software community?
> >
> > MG>if the plugin can predict with 100% accuracy?
> >
> > - Do you have any suggestion on what can be done to improve the change
> recommendation?
> >
> > MG>create the tool as a maven plugin so we can bind this functionality to
> one of the pre compile phases e.g. process-sources?
> >
> >
> >
> > You can visit a webpage to inspect the results in details:
> >
> > Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-lucene
> >
> > Solr Project: http://flosscoach.com/index.php/17-cochanges/74-solr
> >
> > All the best,
> > Igor Wiese
> >
> > Phd Candidate
> >
> >
> >
> > MG>Obrigado do EEUU
> >
> >
> >
> >
> > --
> >
> > =================================
> > Igor Scaliante Wiese
> > PhD Candidate - Computer Science @ IME/USP
> > Faculty in Dept. of Computing at Universidade Tecnológica Federal do
> Paraná
> 
> 
> 
> 
> --
> =================================
> Igor Scaliante Wiese
> PhD Candidate - Computer Science @ IME/USP
> Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Feedback of my Phd work in Lucene and Solr project

Posted by Igor Wiese <ig...@gmail.com>.
Hi Uwe.

We used commits recorded in SVN, not Git. Probably we minimized the
problem, but we got much less commits. In fact we analyzed 4 releases
from SOLr (1.1, 1.2, 1.3 and 1.4 was the last) and 10 releases from
Lucene.

About the heavy commit. We found in some cases more than one commit to
a single issue, and we also tried to identify heavy commit's. However,
we also minimize the number of heavy commits filtering commits with
more than 20 files (to inspect) and removing commits not associated
with issues in the JIRA.

Just to let you know that I will report these situations pointed by
you in my "threat of validity!" discussion.

I really liked your suggestion to inspect changes between
sub-projects. I will think a way to evaluate changes
"inter-subprojects". We analyzed each project as an individual source,
but, it would be very interesting to check this relation when changes
in Solr causes change in Lucene and vice-versa. It is a good future
work to do.

Thanks for your comment and opinion about changes in IndexWriter and
StandardDirectoryWriter
All the best,

Igor


2015-12-10 12:36 GMT-02:00 Uwe Schindler <uw...@thetaphi.de>:
>
> Hi,
>
>
>
> There is one general problem in the analysis:
>
>
>
> Since approx 5 1/2 years, Lucene and Solr are now one project and no longer separated (see https://github.com/apache/solr/blob/trunk/trunk_development_moved.txt; https://github.com/apache/lucene/blob/trunk/trunk_development_moved.txt). You can still look at commits of Solr and Lucene separately, but both are in one repository: https://github.com/apache/lucene-solr:
>
> -          trunk/lucene (master/lucene in Github): Contains code of Lucene
>
> -          trunk/solr (master/solr in Github): Contains code of Solr
>
> -          trunk/dev-tools: shared development scripts, release management scripts
>
>
>
> But you have to be aware that many commits also change both files (because projects are linked together. When you change something like the API in Lucene, you have to commit at same time also the adoption in Solr). You see this in the first report, which was declared in your mail to only cover Lucene, but in fact it was looking at the whole project. Many source code pairs also contained classes from both sub-projects.
>
>
>
> The reason why the number of commits in what you think of “Solr” was low is easy to see: You just looked at the Solr version 1.4, which was the last release of Solr alone. The old location in SVN is no longer used! The number of commits is lower, because if you look at both projects at the same time, you will see way more commits (your first statistic)
>
>
>
> MG>if the pairing was 100% accurate then yes a predictor for both files changing indicates a design issue is lurking i.e
> MG>IndexWriter and StandardDirectoryWriter "share functionality" which would suggest breaking shared methods to interface
> MG>refactoring IndexWriter and StandardDirectoryReader to each implement that shared Interface
> MG>if attributes are to be shared then perhaps an abstract class should be created to contain those shared attributes and implement
> MG>the shared methods
> MG>refactoring IndexWriter and StandardDirectoryReader to extend the abstract class should force implementor to override/reuse
> MG>shared attributes in the Abstract Base Class?
>
>
>
> Keep in mind that a change in those 2 files may also mean sonmething else: IndexWriter uses StandardDirectoryReader to delegate some tasks (Near Realtime Reader support, so there is no code duplication, both are just working together very closely). If you change StandardDirectoryReader’s package private APIs (but also public APIs) and IndexWriter calls them, you have to change the caller, too. This is why you see a change on StandardDirectoryReader together with a change in IndexWriter. But the other way round is different: Changes in IndexWriter seldomly cause changes in StandardDirectoryReader, so its correct what you see.
>
>
>
> The same applies to Solr and Lucene: A change in Solr seldomly causes a change in Lucene, but API breaking changes in Lucene always cause a change in Solr. The Lucene 4 series (around 4.7, which you looked at) was very active in redoing APIs, also we changed to Java 7 in 4.8. So it is very likely that you see changes in Lucene that cause changes in Solr. The argumentation is the same: Whenever you break API, you have to fix callers, too. E.g. Lucene 4.8 had many code refactoring issues (move from Java 6 to Java 7), cleanups of Javadocs,… which naturally touch many files.
>
>
>
> But in general for a subversion based-project like Lucene/Solr it is very hard to rectify a design issue like Martin Gainty did. In the Apache Software Foundation based Subversion projects (nor Git), we generally have a single commit after the whole issue is resolved. We don’t record changes during development of those patches. Some of the patches that came out are very large changes and committed after several days / weeks of work. The likelyhood that you see changes in many unrelated files is high. This differs from Git-based projects, where you commit much more often (we call it “heavy commit”).
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> From: Igor Wiese [mailto:igor.wiese@gmail.com]
> Sent: Thursday, December 10, 2015 2:59 PM
> To: dev@lucene.apache.org
> Subject: Re: Feedback of my Phd work in Lucene and Solr project
>
>
>
> Hi MG. Thanks for the portuguese :-)
>
> I really enjoyed your example. I don't know much about the Lucene/Solr architecture of, but I completely agree. Probably, there is a design problem in this case because the classes seem to be "related". But some of pairs of files that we tested, we couldn't make assumptions because it is not clear why the classes changed together. We probably need to manually inspect the set issues where the files changed together to find the "reason". In some cases, could be very difficuld without have enough know-how of the project.
>
> The good point is that "for a newcomer", for example, it would be hard to find the relation that you mentioned. In such cases we could help :). Do you agree?
>
> I really enjoyed the ideia of "maven plugin". We are creating a tool like a "web service" that could be integrated with the Issue Tracker, but.. i really liked your ideia. I will think about it. Thanks!
>
> Probably we couldn't predict with 100% of accuracy in all of cases :-). In average, as I mentioned, to Lucene we tested more than 1000 commits with 66% of accuracy. To solr the accuracy was low (47%). Probably, the reason to this low accuracy in Solr is related to the number of commits that we used to construct the prediction models. We used 10x less commits in Solr than Lucene.
>
> Considering that in each 4 commits, in 3 of them we could give you good recomendations to change two files together, is good? Do you think that could "save" your time to find the correct files to complete the change?
>
> Thanks Again, MG
>
> All the best,
>
> Igor Wiese
>
>
>
>
>
> 2015-12-10 11:21 GMT-02:00 Martin Gainty <mg...@hotmail.com>:
>
>
>
>
> ________________________________
>
> From: igor.wiese@gmail.com
> Date: Wed, 9 Dec 2015 23:48:10 +0000
> Subject: Feedback of my Phd work in Lucene and Solr project
> To: dev@lucene.apache.org
>
> Hi, Lucene and Solr Community.
>
>
>
> My name is Igor Wiese, phd Student from Brazil. In my research I am investigating two important questions: What makes two files change together? Can we predict when they are going to co-change again?
>
>
>
> I've tried to investigate this question on the Lucene and Solr project. I've collected data from issue reports, discussions and commits and using some machine learning techniques to build a prediction model.
>
>
>
> I collected a total of 1382 commits in which a pair of files changed together and could correctly predict 66% commits in the Lucene Project. For the Solr Project I collected a total of 111 commits in which a pair of files changed together and could correctly predict 47% commits.
>
>
>
> These were the most useful information for predicting co-changes of files:
>
> - number of lines of code added,
>
> - number of lines of code removed,
>
> - sum of number of lines of code added, modified and removed,
>
> - number of words used to describe and discuss the issues, and
>
> - median value of closeness, a social network measure obtained from issue comments.
>
>
>
> To illustrate, consider the following example in Lucene Project from our analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and "lucene/index/StandardDirectoryReader.java" changed together in 4 commits. In another 11 commits, only the first file changed, but not the second. Collecting contextual information for each commit made to first file in previous release, we were able to predict 3 commits in which both files changed together in release 4.7, and we issued 0 false positive, and one wrong prediction. For this pair of files, the most important contextual information was the number of lines of code added in each commit, the number of words used to describe and discuss the issues, the number of comments in each issue and the social network metric (closeness) obtained from issue comments.
>
> MG>if the pairing was 100% accurate then yes a predictor for both files changing indicates a design issue is lurking i.e
> MG>IndexWriter and StandardDirectoryWriter "share functionality" which would suggest breaking shared methods to interface
> MG>refactoring IndexWriter and StandardDirectoryReader to each implement that shared Interface
> MG>if attributes are to be shared then perhaps an abstract class should be created to contain those shared attributes and implement
> MG>the shared methods
> MG>refactoring IndexWriter and StandardDirectoryReader to extend the abstract class should force implementor to override/reuse
> MG>shared attributes in the Abstract Base Class?
>
>
>
> - Do these results surprise you? Can you think in any explanation for the results?
>
> - Do you think that our rate of prediction is good enough to be used for building tool support for the software community?
>
> MG>if the plugin can predict with 100% accuracy?
>
> - Do you have any suggestion on what can be done to improve the change recommendation?
>
> MG>create the tool as a maven plugin so we can bind this functionality to one of the pre compile phases e.g. process-sources?
>
>
>
> You can visit a webpage to inspect the results in details:
>
> Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-lucene
>
> Solr Project: http://flosscoach.com/index.php/17-cochanges/74-solr
>
> All the best,
> Igor Wiese
>
> Phd Candidate
>
>
>
> MG>Obrigado do EEUU
>
>
>
>
> --
>
> =================================
> Igor Scaliante Wiese
> PhD Candidate - Computer Science @ IME/USP
> Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná




-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Feedback of my Phd work in Lucene and Solr project

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

 

There is one general problem in the analysis:

 

Since approx 5 1/2 years, Lucene and Solr are now one project and no longer separated (see https://github.com/apache/solr/blob/trunk/trunk_development_moved.txt; https://github.com/apache/lucene/blob/trunk/trunk_development_moved.txt). You can still look at commits of Solr and Lucene separately, but both are in one repository: https://github.com/apache/lucene-solr:

-          trunk/lucene (master/lucene in Github): Contains code of Lucene

-          trunk/solr (master/solr in Github): Contains code of Solr

-          trunk/dev-tools: shared development scripts, release management scripts

 

But you have to be aware that many commits also change both files (because projects are linked together. When you change something like the API in Lucene, you have to commit at same time also the adoption in Solr). You see this in the first report, which was declared in your mail to only cover Lucene, but in fact it was looking at the whole project. Many source code pairs also contained classes from both sub-projects.

 

The reason why the number of commits in what you think of “Solr” was low is easy to see: You just looked at the Solr version 1.4, which was the last release of Solr alone. The old location in SVN is no longer used! The number of commits is lower, because if you look at both projects at the same time, you will see way more commits (your first statistic)

 

MG>if the pairing was 100% accurate then yes a predictor for both files changing indicates a design issue is lurking i.e
MG>IndexWriter and StandardDirectoryWriter "share functionality" which would suggest breaking shared methods to interface
MG>refactoring IndexWriter and StandardDirectoryReader to each implement that shared Interface
MG>if attributes are to be shared then perhaps an abstract class should be created to contain those shared attributes and implement
MG>the shared methods
MG>refactoring IndexWriter and StandardDirectoryReader to extend the abstract class should force implementor to override/reuse
MG>shared attributes in the Abstract Base Class?

 

Keep in mind that a change in those 2 files may also mean sonmething else: IndexWriter uses StandardDirectoryReader to delegate some tasks (Near Realtime Reader support, so there is no code duplication, both are just working together very closely). If you change StandardDirectoryReader’s package private APIs (but also public APIs) and IndexWriter calls them, you have to change the caller, too. This is why you see a change on StandardDirectoryReader together with a change in IndexWriter. But the other way round is different: Changes in IndexWriter seldomly cause changes in StandardDirectoryReader, so its correct what you see.

 

The same applies to Solr and Lucene: A change in Solr seldomly causes a change in Lucene, but API breaking changes in Lucene always cause a change in Solr. The Lucene 4 series (around 4.7, which you looked at) was very active in redoing APIs, also we changed to Java 7 in 4.8. So it is very likely that you see changes in Lucene that cause changes in Solr. The argumentation is the same: Whenever you break API, you have to fix callers, too. E.g. Lucene 4.8 had many code refactoring issues (move from Java 6 to Java 7), cleanups of Javadocs,… which naturally touch many files.

 

But in general for a subversion based-project like Lucene/Solr it is very hard to rectify a design issue like Martin Gainty did. In the Apache Software Foundation based Subversion projects (nor Git), we generally have a single commit after the whole issue is resolved. We don’t record changes during development of those patches. Some of the patches that came out are very large changes and committed after several days / weeks of work. The likelyhood that you see changes in many unrelated files is high. This differs from Git-based projects, where you commit much more often (we call it “heavy commit”). 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Igor Wiese [mailto:igor.wiese@gmail.com] 
Sent: Thursday, December 10, 2015 2:59 PM
To: dev@lucene.apache.org
Subject: Re: Feedback of my Phd work in Lucene and Solr project

 

Hi MG. Thanks for the portuguese :-)

I really enjoyed your example. I don't know much about the Lucene/Solr architecture of, but I completely agree. Probably, there is a design problem in this case because the classes seem to be "related". But some of pairs of files that we tested, we couldn't make assumptions because it is not clear why the classes changed together. We probably need to manually inspect the set issues where the files changed together to find the "reason". In some cases, could be very difficuld without have enough know-how of the project.

The good point is that "for a newcomer", for example, it would be hard to find the relation that you mentioned. In such cases we could help :). Do you agree?

I really enjoyed the ideia of "maven plugin". We are creating a tool like a "web service" that could be integrated with the Issue Tracker, but.. i really liked your ideia. I will think about it. Thanks!

Probably we couldn't predict with 100% of accuracy in all of cases :-). In average, as I mentioned, to Lucene we tested more than 1000 commits with 66% of accuracy. To solr the accuracy was low (47%). Probably, the reason to this low accuracy in Solr is related to the number of commits that we used to construct the prediction models. We used 10x less commits in Solr than Lucene.

Considering that in each 4 commits, in 3 of them we could give you good recomendations to change two files together, is good? Do you think that could "save" your time to find the correct files to complete the change? 

Thanks Again, MG

All the best,

Igor Wiese

 

 

2015-12-10 11:21 GMT-02:00 Martin Gainty <mgainty@hotmail.com <ma...@hotmail.com> >:






  _____  

From:  <ma...@gmail.com> igor.wiese@gmail.com
Date: Wed, 9 Dec 2015 23:48:10 +0000
Subject: Feedback of my Phd work in Lucene and Solr project
To:  <ma...@lucene.apache.org> dev@lucene.apache.org

Hi, Lucene and Solr Community. 

 

My name is Igor Wiese, phd Student from Brazil. In my research I am investigating two important questions: What makes two files change together? Can we predict when they are going to co-change again? 

 

I've tried to investigate this question on the Lucene and Solr project. I've collected data from issue reports, discussions and commits and using some machine learning techniques to build a prediction model.

 

I collected a total of 1382 commits in which a pair of files changed together and could correctly predict 66% commits in the Lucene Project. For the Solr Project I collected a total of 111 commits in which a pair of files changed together and could correctly predict 47% commits.

 

These were the most useful information for predicting co-changes of files: 

- number of lines of code added,

- number of lines of code removed,

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues, and

- median value of closeness, a social network measure obtained from issue comments.

 

To illustrate, consider the following example in Lucene Project from our analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and "lucene/index/StandardDirectoryReader.java" changed together in 4 commits. In another 11 commits, only the first file changed, but not the second. Collecting contextual information for each commit made to first file in previous release, we were able to predict 3 commits in which both files changed together in release 4.7, and we issued 0 false positive, and one wrong prediction. For this pair of files, the most important contextual information was the number of lines of code added in each commit, the number of words used to describe and discuss the issues, the number of comments in each issue and the social network metric (closeness) obtained from issue comments.

MG>if the pairing was 100% accurate then yes a predictor for both files changing indicates a design issue is lurking i.e
MG>IndexWriter and StandardDirectoryWriter "share functionality" which would suggest breaking shared methods to interface
MG>refactoring IndexWriter and StandardDirectoryReader to each implement that shared Interface
MG>if attributes are to be shared then perhaps an abstract class should be created to contain those shared attributes and implement
MG>the shared methods
MG>refactoring IndexWriter and StandardDirectoryReader to extend the abstract class should force implementor to override/reuse
MG>shared attributes in the Abstract Base Class?

 

- Do these results surprise you? Can you think in any explanation for the results?

- Do you think that our rate of prediction is good enough to be used for building tool support for the software community?

MG>if the plugin can predict with 100% accuracy?

- Do you have any suggestion on what can be done to improve the change recommendation?

MG>create the tool as a maven plugin so we can bind this functionality to one of the pre compile phases e.g. process-sources?

 

You can visit a webpage to inspect the results in details: 

Lucene Project:  <http://flosscoach.com/index.php/17-cochanges/73-lucene> http://flosscoach.com/index.php/17-cochanges/73-lucene

Solr Project:  <http://flosscoach.com/index.php/17-cochanges/74-solr> http://flosscoach.com/index.php/17-cochanges/74-solr

All the best, 
Igor Wiese

Phd Candidate

 

MG>Obrigado do EEUU




-- 

=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná


Re: Feedback of my Phd work in Lucene and Solr project

Posted by Igor Wiese <ig...@gmail.com>.
Hi MG. Thanks for the portuguese :-)

I really enjoyed your example. I don't know much about the Lucene/Solr
architecture of, but I completely agree. Probably, there is a design
problem in this case because the classes seem to be "related". But some of
pairs of files that we tested, we couldn't make assumptions because it is
not clear why the classes changed together. We probably need to manually
inspect the set issues where the files changed together to find the
"reason". In some cases, could be very difficuld without have enough
know-how of the project.

The good point is that "for a newcomer", for example, it would be hard to
find the relation that you mentioned. In such cases we could help :). Do
you agree?

I really enjoyed the ideia of "maven plugin". We are creating a tool like a
"web service" that could be integrated with the Issue Tracker, but.. i
really liked your ideia. I will think about it. Thanks!

Probably we couldn't predict with 100% of accuracy in all of cases :-). In
average, as I mentioned, to Lucene we tested more than 1000 commits with
66% of accuracy. To solr the accuracy was low (47%). Probably, the reason
to this low accuracy in Solr is related to the number of commits that we
used to construct the prediction models. We used 10x less commits in Solr
than Lucene.

Considering that in each 4 commits, in 3 of them we could give you good
recomendations to change two files together, is good? Do you think that
could "save" your time to find the correct files to complete the change?

Thanks Again, MG
All the best,
Igor Wiese


2015-12-10 11:21 GMT-02:00 Martin Gainty <mg...@hotmail.com>:

>
>
>
>
> ------------------------------
> From: igor.wiese@gmail.com
> Date: Wed, 9 Dec 2015 23:48:10 +0000
> Subject: Feedback of my Phd work in Lucene and Solr project
> To: dev@lucene.apache.org
>
> Hi, Lucene and Solr Community.
>
> My name is Igor Wiese, phd Student from Brazil. In my research I am
> investigating two important questions: What makes two files change
> together? Can we predict when they are going to co-change again?
>
> I've tried to investigate this question on the Lucene and Solr project.
> I've collected data from issue reports, discussions and commits and using
> some machine learning techniques to build a prediction model.
>
> I collected a total of 1382 commits in which a pair of files changed
> together and could correctly predict 66% commits in the Lucene Project. For
> the Solr Project I collected a total of 111 commits in which a pair of
> files changed together and could correctly predict 47% commits.
>
> These were the most useful information for predicting co-changes of files:
>
> - number of lines of code added,
>
> - number of lines of code removed,
>
> - sum of number of lines of code added, modified and removed,
>
> - number of words used to describe and discuss the issues, and
>
> - median value of closeness, a social network measure obtained from issue
> comments.
>
> To illustrate, consider the following example in Lucene Project from our
> analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and
> "lucene/index/StandardDirectoryReader.java" changed together in 4 commits.
> In another 11 commits, only the first file changed, but not the second.
> Collecting contextual information for each commit made to first file in
> previous release, we were able to predict 3 commits in which both files
> changed together in release 4.7, and we issued 0 false positive, and one
> wrong prediction. For this pair of files, the most important contextual
> information was the number of lines of code added in each commit, the
> number of words used to describe and discuss the issues, the number of
> comments in each issue and the social network metric (closeness) obtained
> from issue comments.
>
> MG>if the pairing was 100% accurate then yes a predictor for both files
> changing indicates a design issue is lurking i.e
> MG>IndexWriter and StandardDirectoryWriter "share functionality" which
> would suggest breaking shared methods to interface
> MG>refactoring IndexWriter and StandardDirectoryReader to each implement
> that shared Interface
> MG>if attributes are to be shared then perhaps an abstract class should be
> created to contain those shared attributes and implement
> MG>the shared methods
> MG>refactoring IndexWriter and StandardDirectoryReader to extend the
> abstract class should force implementor to override/reuse
> MG>shared attributes in the Abstract Base Class?
>
> - Do these results surprise you? Can you think in any explanation for the
> results?
>
> - Do you think that our rate of prediction is good enough to be used for
> building tool support for the software community?
>
> MG>if the plugin can predict with 100% accuracy?
>
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
>
> MG>create the tool as a maven plugin so we can bind this functionality to
> one of the pre compile phases e.g. process-sources?
>
> You can visit a webpage to inspect the results in details:
>
> Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-lucene
> Solr Project: http://flosscoach.com/index.php/17-cochanges/74-solr
>
> All the best,
> Igor Wiese
> Phd Candidate
>
> MG>Obrigado do EEUU
>



-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

RE: Feedback of my Phd work in Lucene and Solr project

Posted by Martin Gainty <mg...@hotmail.com>.


From: igor.wiese@gmail.com
Date: Wed, 9 Dec 2015 23:48:10 +0000
Subject: Feedback of my Phd work in Lucene and Solr project
To: dev@lucene.apache.org

Hi, Lucene and Solr Community. 
My name is Igor Wiese, phd Student from Brazil. In my research I am investigating two important questions: What makes two files change together? Can we predict when they are going to co-change again? 
I've tried to investigate this question on the Lucene and Solr project. I've collected data from issue reports, discussions and commits and using some machine learning techniques to build a prediction model.
I collected a total of 1382 commits in which a pair of files changed together and could correctly predict 66% commits in the Lucene Project. For the Solr Project I collected a total of 111 commits in which a pair of files changed together and could correctly predict 47% commits.
These were the most useful information for predicting co-changes of files: - number of lines of code added,- number of lines of code removed,- sum of number of lines of code added, modified and removed,- number of words used to describe and discuss the issues, and- median value of closeness, a social network measure obtained from issue comments.
To illustrate, consider the following example in Lucene Project from our analysis. For release 4.7, the files "lucene/index/IndexWriter.java" and "lucene/index/StandardDirectoryReader.java" changed together in 4 commits. In another 11 commits, only the first file changed, but not the second. Collecting contextual information for each commit made to first file in previous release, we were able to predict 3 commits in which both files changed together in release 4.7, and we issued 0 false positive, and one wrong prediction. For this pair of files, the most important contextual information was the number of lines of code added in each commit, the number of words used to describe and discuss the issues, the number of comments in each issue and the social network metric (closeness) obtained from issue comments.MG>if the pairing was 100% accurate then yes a predictor for both files changing indicates a design issue is lurking i.e
MG>IndexWriter and StandardDirectoryWriter "share functionality" which would suggest breaking shared methods to interface
MG>refactoring IndexWriter and StandardDirectoryReader to each implement that shared Interface
MG>if attributes are to be shared then perhaps an abstract class should be created to contain those shared attributes and implement
MG>the shared methods
MG>refactoring IndexWriter and StandardDirectoryReader to extend the abstract class should force implementor to override/reuse
MG>shared attributes in the Abstract Base Class?
- Do these results surprise you? Can you think in any explanation for the results?- Do you think that our rate of prediction is good enough to be used for building tool support for the software community?MG>if the plugin can predict with 100% accuracy?

- Do you have any suggestion on what can be done to improve the change recommendation?MG>create the tool as a maven plugin so we can bind this functionality to one of the pre compile phases e.g. process-sources?
You can visit a webpage to inspect the results in details: Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-luceneSolr Project: http://flosscoach.com/index.php/17-cochanges/74-solr

All the best, Igor WiesePhd Candidate
MG>Obrigado do EEUU