You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Roman Shaposhnik <ro...@shaposhnik.org> on 2014/04/03 07:49:23 UTC
Re: [DISCUSS] Proposal for a Black Duck POC

On Mon, Mar 31, 2014 at 2:13 AM, Rob Vesse <rv...@dotnetrdf.org> wrote:
> Roman
>
> Black Duck software certainly have a useful platform though it would be
> useful to know what they are considering using for the POC.

I think it would be fair to say that they are looking at us to set
requirements for this POC. This doesn't seem to be a drive-by
software donation on their part, but rather a genuine interest
in seeing how their software *and* services can be leveraged
by ASF.

> I would certainly recommend trying a POC but I¹m not sure it is
> necessarily something you¹d want to impose on all incoming projects in the
> long term.

Indeed. The proof is very much in the pudding. Personally
I'm curious and willing to collaborate with BlackDuck. And it
looks like at least you and Jim fall into the same category.
It'll be fun!

> My main concerns are that Protex while very useful is somewhat dumb
> primarily due to the quality of its knowledge base.  For those who aren¹t
> aware essentially the tool scans the code looking for files that have
> ³signatures² that match other open source/proprietary code in the
> knowledge base.  The open source code is scraped from all sorts of public
> sites like SourceForge, GitHub, BitBucket etc.  For each match that occurs
> someone has to review the match and then they can indicate whether to
> exclude that match I.e. it was a false positive or to accept that match
> and attribute it appropriately.
>
> This is great in principle because it easily spots obvious plagiarism when
> it occurs.  The problem from my point of view is that the false positive
> rate is very high and then you have to go through all the matches and
> manually state whether they are valid/invalid.  This ends up being very
> time consuming because for each match on your code you have to review all
> the possible matches to see if there actually is a genuine match and if
> not then go through a process of telling the tool
>
> This is where the knowledge base starts to hurt you, there are lots of
> projects out there which check in everything including things like
> auto-generated IDE project files, build tool reports, VCS ignore files etc
> which tend to have very high similarity and get flagged up as false
> positives constantly.  Ideally Apache projects won¹t themselves be
> checking these things in so the chances of these getting flagged should be
> low.
>
> As a more practical example I had a recent case where I was working
> through an analysis on some Hadoop related code my company is considering
> open sourcing which is primarily a collection of implementations of
> InputFormat and OutputFormat.  A good number of our code files were
> flagged as potential matches and when reviewed the only similarity was
> that we had the same set of imports as many other Hadoop ecosystem
> projects.  This is of course exacerbated by the fact that many developers
> use IDEs which organise their imports!  So I had to spend several hours
> checking each file and ticking boxes in Protex to say that this was
> original code and not plagiarised.
>
> I would definitely recommend carrying out a POC and seeing what people
> make of it but be aware that it can be a painful and time consuming
> process.

Good points! Definitely worth keeping in mind.

Thanks,
Roman.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org