You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@community.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2016/02/03 02:39:13 UTC

Re: DRAT is now scanning Apache SVN code base!

Hi Karanjeet,

A good bunch of work has lready gone into this and it is looking really
friggin smart indeed.
Interesting to see some many pieces of software come together and result in
something very easy to interpret.
Good work.
Lewis

On Mon, Feb 1, 2016 at 11:44 PM, <de...@community.apache.org>
wrote:

> Hello Everyone,
>
> With great pleasure, I would like to introduce DRAT (Distributed Release
> Audit Tool) which is a distributed, parallelized wrapper around Apache RAT
> to inspect for appropriate open source licensing in software projects.
> DRAT was started by my advisor, Chris Mattmann, in an effort to get RAT
> working on a ver large code base. RAT uses Apache OODT, Apache Tika, and
> Apache Solr.
>
> We are now auditing the complete Apache SVN code base to check for proper
> licenses. Until now, we have scanned 171 / 191 repositories and
> illustrated the statistics for 133 of them through D3 visualization
> located at http://drat.dyndns.org:8080/dratviz
>
> Projects should check out the MIME analysis of the code base and click
> around. Please also note due to the sheer size of the Apache code bases
> and the fact that we scanned and included all revisions in the Apache SVN
> repo, DRAT is not running in real time. We are running DRAT on the NSF
> Super Computer Wrangler, which has a petabyte of flash storage and the
> ability to stand up Hadoop and Spark clusters. We are also working on a
> paper describing our results.
>
> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
> Professor Mattmann <ma...@usc.edu> and/or irds-L@mymaillists.usc.edu.
>
> Thanks & Regards,
> Karanjeet Singh
> C.S. Graduate Student
> University of Southern California
> karanjes@usc.edu | +1-213-675-9583

Re: DRAT is now scanning Apache SVN code base!

Posted by muktesh mishra <mu...@hotmail.com>.
Nice work !! Karanjeet.

It is helping in multiple ways and will improve by the time for sure.

Do we have plans to implement more custom attribute based dashboards as well??

-Muktesh

> On Feb 3, 2016, at 7:12 AM, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> wrote:
> 
> Hey Tony,
> 
> Sorry we should have been more clear about that. We are using
> svn-dump (we aren’t doing this in real-time). Karanjeet can
> confirm, but from that dump file, I think we were able to
> generate this interactive app using DRAT in about a week or
> a little more. It’s worth noting that it’s *all* the revisions
> in that dump that we are scanning so this captures the evolution
> of the repo over time as well.
> 
> Thanks for the pointers to Git repos too. We’ll scan those
> with DRAT and Wrangler next.
> 
> Also hat tip to the Solr community and so forth too - our DRAT
> stats are powered by Apache Solr (and D3).
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Tony Stevenson <to...@pc-tony.com>
> Reply-To: "dev@community.apache.org" <de...@community.apache.org>
> Date: Wednesday, February 3, 2016 at 1:06 AM
> To: "dev@community.apache.org" <de...@community.apache.org>, Karanjeet Singh
> <ka...@usc.edu>
> Cc: "pierre.smits@gmail.com" <pi...@gmail.com>, Infra
> <in...@apache.org>
> Subject: Re: DRAT is now scanning Apache SVN code base!
> 
>> cc += infra@
>> 
>> Karanjeet,
>> 
>> I am writing to you whilst wearing my Infrastructure hat.
>> 
>> Please be careful if you are indeed recursing the entire ASF subversion
>> repository (http://svn.apache.org) - as you will quite likely run into
>> the aug-banning service.
>> Have you seen https://svn-dump.apache.org ?  This is an entire dump of
>> the SVN repo (at least the public one you are interested in. You can use
>> this, and it is updated monthly. If you really need fully upto date data
>> you can use the dump, and svnsync the remaining revisions.
>> 
>> I guess this might be obvious, but I’ll mention it just in case.  A lot
>> of projects are using git repositories too. Which are mirrored here:
>> github.com/apache/
>> 
>> 
>> 
>> --
>> Cheers,
>> Tony
>> 
>> On behalf of the Apache Infrastructure Team
>> 
>> -----------------------
>> http://www.pc-tony.com
>> GPG - 3072D/2543E323
>> -----------------------
>> 
>>> On 3 Feb 2016, at 08:58, Karanjeet Singh <ka...@usc.edu> wrote:
>>> 
>>> Thanks Pierre for your feedback.
>>> 
>>> Yes, the visualization corresponds to only 133 / 191 SVN projects (
>>> http://svn.apache.org/repos/asf/). We have successfully audited close to
>>> 175 projects and hopefully by the end of this week all the remaining
>>> projects should be covered. We will update the data once done.
>>> 
>>> Large repositories like "subversion" and "camel" having 493,420 files
>>> (size
>>> - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up
>>> to
>>> 36 hours (only) to complete which is quite a good number.
>>> 
>>> For your second question, I don't have an answer yet. Our intentions
>>> will
>>> be to update this regularly but we have some limitation at the Wrangler
>>> end
>>> that it doesn't allow us to run a job for more than 48 hours. Therefore,
>>> for very large repositories like openoffice, spamassassin, myfaces, etc,
>>> which takes more time to get audited, it will be a challenge to split
>>> the
>>> repositories every time and scan.
>>> 
>>> Best Regards,
>>> Karanjeet Singh
>>> CS Graduate Student
>>> University of Southern California
>>> karanjes@usc.edu | +1-213-675-9583
>>> 
>>> 
>>> On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
>>> wrote:
>>> 
>>>> HI Karanjeet,
>>>> 
>>>> This is surely an impressive piece of work. But I still notice that
>>>> some
>>>> projects are missing in the overview. Is this a mere PoC not intended
>>>> to be
>>>> complete? Or something that will be made available to all and be
>>>> updated
>>>> regularly?
>>>> 
>>>> Best regards,
>>>> 
>>>> Pierre Smits
>>>> 
>>>> ORRTIZ.COM <
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwI
>>>> BaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9z
>>>> pA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqc
>>>> WeBc3x1dze3BDFEgOry1zo&e=
>>>> OFBiz based solutions & services
>>>> 
>>>> OFBiz Extensions Marketplace
>>>> 
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-
>>>> 2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVm
>>>> QKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P
>>>> 3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
>>>> 
>>>> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
>>>> lewis.mcgibbney@gmail.com> wrote:
>>>> 
>>>>> Hi Karanjeet,
>>>>> 
>>>>> A good bunch of work has lready gone into this and it is looking
>>>>> really
>>>>> friggin smart indeed.
>>>>> Interesting to see some many pieces of software come together and
>>>>> result
>>>> in
>>>>> something very easy to interpret.
>>>>> Good work.
>>>>> Lewis
>>>>> 
>>>>> On Mon, Feb 1, 2016 at 11:44 PM,
>>>>> <de...@community.apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Hello Everyone,
>>>>>> 
>>>>>> With great pleasure, I would like to introduce DRAT (Distributed
>>>> Release
>>>>>> Audit Tool) which is a distributed, parallelized wrapper around
>>>>>> Apache
>>>>> RAT
>>>>>> to inspect for appropriate open source licensing in software
>>>>>> projects.
>>>>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get
>>>>>> RAT
>>>>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika,
>>>> and
>>>>>> Apache Solr.
>>>>>> 
>>>>>> We are now auditing the complete Apache SVN code base to check for
>>>> proper
>>>>>> licenses. Until now, we have scanned 171 / 191 repositories and
>>>>>> illustrated the statistics for 133 of them through D3 visualization
>>>>>> located at
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A80
>>>> 80_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7ne
>>>> GGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=Eiqoi
>>>> xInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
>>>>>> 
>>>>>> Projects should check out the MIME analysis of the code base and
>>>>>> click
>>>>>> around. Please also note due to the sheer size of the Apache code
>>>>>> bases
>>>>>> and the fact that we scanned and included all revisions in the Apache
>>>> SVN
>>>>>> repo, DRAT is not running in real time. We are running DRAT on the
>>>>>> NSF
>>>>>> Super Computer Wrangler, which has a petabyte of flash storage and
>>>>>> the
>>>>>> ability to stand up Hadoop and Spark clusters. We are also working
>>>>>> on a
>>>>>> paper describing our results.
>>>>>> 
>>>>>> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
>>>>>> Professor Mattmann <ma...@usc.edu> and/or
>>>> irds-L@mymaillists.usc.edu.
>>>>>> 
>>>>>> Thanks & Regards,
>>>>>> Karanjeet Singh
>>>>>> C.S. Graduate Student
>>>>>> University of Southern California
>>>>>> karanjes@usc.edu | +1-213-675-9583
> 

Re: DRAT is now scanning Apache SVN code base!

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hey Tony,

Sorry we should have been more clear about that. We are using
svn-dump (we aren’t doing this in real-time). Karanjeet can
confirm, but from that dump file, I think we were able to
generate this interactive app using DRAT in about a week or
a little more. It’s worth noting that it’s *all* the revisions
in that dump that we are scanning so this captures the evolution
of the repo over time as well.

Thanks for the pointers to Git repos too. We’ll scan those
with DRAT and Wrangler next.

Also hat tip to the Solr community and so forth too - our DRAT
stats are powered by Apache Solr (and D3).

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Tony Stevenson <to...@pc-tony.com>
Reply-To: "dev@community.apache.org" <de...@community.apache.org>
Date: Wednesday, February 3, 2016 at 1:06 AM
To: "dev@community.apache.org" <de...@community.apache.org>, Karanjeet Singh
<ka...@usc.edu>
Cc: "pierre.smits@gmail.com" <pi...@gmail.com>, Infra
<in...@apache.org>
Subject: Re: DRAT is now scanning Apache SVN code base!

>cc += infra@
>
>Karanjeet,
>
>I am writing to you whilst wearing my Infrastructure hat.
>
>Please be careful if you are indeed recursing the entire ASF subversion
>repository (http://svn.apache.org) - as you will quite likely run into
>the aug-banning service.
>Have you seen https://svn-dump.apache.org ?  This is an entire dump of
>the SVN repo (at least the public one you are interested in. You can use
>this, and it is updated monthly. If you really need fully upto date data
>you can use the dump, and svnsync the remaining revisions.
>
>I guess this might be obvious, but I’ll mention it just in case.  A lot
>of projects are using git repositories too. Which are mirrored here:
>github.com/apache/
>
>
>
>--
>Cheers,
>Tony
>
>On behalf of the Apache Infrastructure Team
>
>-----------------------
>http://www.pc-tony.com
>GPG - 3072D/2543E323
>-----------------------
>
>> On 3 Feb 2016, at 08:58, Karanjeet Singh <ka...@usc.edu> wrote:
>> 
>> Thanks Pierre for your feedback.
>> 
>> Yes, the visualization corresponds to only 133 / 191 SVN projects (
>> http://svn.apache.org/repos/asf/). We have successfully audited close to
>> 175 projects and hopefully by the end of this week all the remaining
>> projects should be covered. We will update the data once done.
>> 
>> Large repositories like "subversion" and "camel" having 493,420 files
>>(size
>> - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up
>>to
>> 36 hours (only) to complete which is quite a good number.
>> 
>> For your second question, I don't have an answer yet. Our intentions
>>will
>> be to update this regularly but we have some limitation at the Wrangler
>>end
>> that it doesn't allow us to run a job for more than 48 hours. Therefore,
>> for very large repositories like openoffice, spamassassin, myfaces, etc,
>> which takes more time to get audited, it will be a challenge to split
>>the
>> repositories every time and scan.
>> 
>> Best Regards,
>> Karanjeet Singh
>> CS Graduate Student
>> University of Southern California
>> karanjes@usc.edu | +1-213-675-9583
>> 
>> 
>> On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
>> wrote:
>> 
>>> HI Karanjeet,
>>> 
>>> This is surely an impressive piece of work. But I still notice that
>>>some
>>> projects are missing in the overview. Is this a mere PoC not intended
>>>to be
>>> complete? Or something that will be made available to all and be
>>>updated
>>> regularly?
>>> 
>>> Best regards,
>>> 
>>> Pierre Smits
>>> 
>>> ORRTIZ.COM <
>>> 
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwI
>>>BaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9z
>>>pA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqc
>>>WeBc3x1dze3BDFEgOry1zo&e=
>>>> 
>>> OFBiz based solutions & services
>>> 
>>> OFBiz Extensions Marketplace
>>> 
>>> 
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-
>>>2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVm
>>>QKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P
>>>3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
>>> 
>>> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
>>> lewis.mcgibbney@gmail.com> wrote:
>>> 
>>>> Hi Karanjeet,
>>>> 
>>>> A good bunch of work has lready gone into this and it is looking
>>>>really
>>>> friggin smart indeed.
>>>> Interesting to see some many pieces of software come together and
>>>>result
>>> in
>>>> something very easy to interpret.
>>>> Good work.
>>>> Lewis
>>>> 
>>>> On Mon, Feb 1, 2016 at 11:44 PM,
>>>><de...@community.apache.org>
>>>> wrote:
>>>> 
>>>>> Hello Everyone,
>>>>> 
>>>>> With great pleasure, I would like to introduce DRAT (Distributed
>>> Release
>>>>> Audit Tool) which is a distributed, parallelized wrapper around
>>>>>Apache
>>>> RAT
>>>>> to inspect for appropriate open source licensing in software
>>>>>projects.
>>>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get
>>>>>RAT
>>>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika,
>>> and
>>>>> Apache Solr.
>>>>> 
>>>>> We are now auditing the complete Apache SVN code base to check for
>>> proper
>>>>> licenses. Until now, we have scanned 171 / 191 repositories and
>>>>> illustrated the statistics for 133 of them through D3 visualization
>>>>> located at
>>> 
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A80
>>>80_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7ne
>>>GGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=Eiqoi
>>>xInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
>>>>> 
>>>>> Projects should check out the MIME analysis of the code base and
>>>>>click
>>>>> around. Please also note due to the sheer size of the Apache code
>>>>>bases
>>>>> and the fact that we scanned and included all revisions in the Apache
>>> SVN
>>>>> repo, DRAT is not running in real time. We are running DRAT on the
>>>>>NSF
>>>>> Super Computer Wrangler, which has a petabyte of flash storage and
>>>>>the
>>>>> ability to stand up Hadoop and Spark clusters. We are also working
>>>>>on a
>>>>> paper describing our results.
>>>>> 
>>>>> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
>>>>> Professor Mattmann <ma...@usc.edu> and/or
>>> irds-L@mymaillists.usc.edu.
>>>>> 
>>>>> Thanks & Regards,
>>>>> Karanjeet Singh
>>>>> C.S. Graduate Student
>>>>> University of Southern California
>>>>> karanjes@usc.edu | +1-213-675-9583
>>>> 
>>> 
>


Re: DRAT is now scanning Apache SVN code base!

Posted by Don Cunningham <ot...@gmail.com>.
On Feb 3, 2016 4:06 AM, "Tony Stevenson" <to...@pc-tony.com> wrote:

> cc += infra@
>
> Karanjeet,
>
> I am writing to you whilst wearing my Infrastructure hat.
>
> Please be careful if you are indeed recursing the entire ASF subversion
> repository (http://svn.apache.org) - as you will quite likely run into
> the aug-banning service.
> Have you seen https://svn-dump.apache.org ?  This is an entire dump of
> the SVN repo (at least the public one you are interested in. You can use
> this, and it is updated monthly. If you really need fully upto date data
> you can use the dump, and svnsync the remaining revisions.
>
> I guess this might be obvious, but I’ll mention it just in case.  A lot of
> projects are using git repositories too. Which are mirrored here:
> github.com/apache/
>
>
>
> --
> Cheers,
> Tony
>
> On behalf of the Apache Infrastructure Team
>
> -----------------------
> http://www.pc-tony.com
> GPG - 3072D/2543E323
> -----------------------
>
> > On 3 Feb 2016, at 08:58, Karanjeet Singh <ka...@usc.edu> wrote:
> >
> > Thanks Pierre for your feedback.
> >
> > Yes, the visualization corresponds to only 133 / 191 SVN projects (
> > http://svn.apache.org/repos/asf/). We have successfully audited close to
> > 175 projects and hopefully by the end of this week all the remaining
> > projects should be covered. We will update the data once done.
> >
> > Large repositories like "subversion" and "camel" having 493,420 files
> (size
> > - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up
> to
> > 36 hours (only) to complete which is quite a good number.
> >
> > For your second question, I don't have an answer yet. Our intentions will
> > be to update this regularly but we have some limitation at the Wrangler
> end
> > that it doesn't allow us to run a job for more than 48 hours. Therefore,
> > for very large repositories like openoffice, spamassassin, myfaces, etc,
> > which takes more time to get audited, it will be a challenge to split the
> > repositories every time and scan.
> >
> > Best Regards,
> > Karanjeet Singh
> > CS Graduate Student
> > University of Southern California
> > karanjes@usc.edu | +1-213-675-9583
> >
> >
> > On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
> > wrote:
> >
> >> HI Karanjeet,
> >>
> >> This is surely an impressive piece of work. But I still notice that some
> >> projects are missing in the overview. Is this a mere PoC not intended
> to be
> >> complete? Or something that will be made available to all and be updated
> >> regularly?
> >>
> >> Best regards,
> >>
> >> Pierre Smits
> >>
> >> ORRTIZ.COM <
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e=
> >>>
> >> OFBiz based solutions & services
> >>
> >> OFBiz Extensions Marketplace
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
> >>
> >> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >>> Hi Karanjeet,
> >>>
> >>> A good bunch of work has lready gone into this and it is looking really
> >>> friggin smart indeed.
> >>> Interesting to see some many pieces of software come together and
> result
> >> in
> >>> something very easy to interpret.
> >>> Good work.
> >>> Lewis
> >>>
> >>> On Mon, Feb 1, 2016 at 11:44 PM, <dev-digest-help@community.apache.org
> >
> >>> wrote:
> >>>
> >>>> Hello Everyone,
> >>>>
> >>>> With great pleasure, I would like to introduce DRAT (Distributed
> >> Release
> >>>> Audit Tool) which is a distributed, parallelized wrapper around Apache
> >>> RAT
> >>>> to inspect for appropriate open source licensing in software projects.
> >>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get
> RAT
> >>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika,
> >> and
> >>>> Apache Solr.
> >>>>
> >>>> We are now auditing the complete Apache SVN code base to check for
> >> proper
> >>>> licenses. Until now, we have scanned 171 / 191 repositories and
> >>>> illustrated the statistics for 133 of them through D3 visualization
> >>>> located at
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
> >>>>
> >>>> Projects should check out the MIME analysis of the code base and click
> >>>> around. Please also note due to the sheer size of the Apache code
> bases
> >>>> and the fact that we scanned and included all revisions in the Apache
> >> SVN
> >>>> repo, DRAT is not running in real time. We are running DRAT on the NSF
> >>>> Super Computer Wrangler, which has a petabyte of flash storage and the
> >>>> ability to stand up Hadoop and Spark clusters. We are also working on
> a
> >>>> paper describing our results.
> >>>>
> >>>> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
> >>>> Professor Mattmann <ma...@usc.edu> and/or
> >> irds-L@mymaillists.usc.edu.
> >>>>
> >>>> Thanks & Regards,
> >>>> Karanjeet Singh
> >>>> C.S. Graduate Student
> >>>> University of Southern California
> >>>> karanjes@usc.edu | +1-213-675-9583
> >>>
> >>
>
>

Re: DRAT is now scanning Apache SVN code base!

Posted by Karanjeet Singh <ka...@usc.edu>.
Thanks Don and Tony.

Yes, we have used the http://svn-dump.apache.org/ link to download the SVN
dump and then we are running DRAT on it.

The other link was just for reference.

I hope, I am safe from the aug-banning service. :)

Best Regards,
Karanjeet Singh
C.S. Graduate Student
University of Southern California
karanjes@usc.edu | +1-213-675-9583

On Wed, Feb 3, 2016 at 1:07 AM, Don Cunningham <ot...@gmail.com> wrote:

> On Feb 3, 2016 4:06 AM, "Tony Stevenson" <to...@pc-tony.com> wrote:
>
>> cc += infra@
>>
>> Karanjeet,
>>
>> I am writing to you whilst wearing my Infrastructure hat.
>>
>> Please be careful if you are indeed recursing the entire ASF subversion
>> repository (http://svn.apache.org
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=d_X9L9oLXCkkHS5f1V4oihsxSwxuq7o9IWaCkw2eb9M&e=>)
>> - as you will quite likely run into the aug-banning service.
>> Have you seen https://svn-dump.apache.org
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__svn-2Ddump.apache.org&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=PX-TjYkrYF2jtnk0eGgBJvKriwcbOcgIeENvi52T7sE&e=>
>> ?  This is an entire dump of the SVN repo (at least the public one you are
>> interested in. You can use this, and it is updated monthly. If you really
>> need fully upto date data you can use the dump, and svnsync the remaining
>> revisions.
>>
>> I guess this might be obvious, but I’ll mention it just in case.  A lot
>> of projects are using git repositories too. Which are mirrored here:
>> github.com/apache/
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_apache_&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=a5TxI_VOrBw4vEQDR21R7aI59AIJINFcpGunOZJVAxQ&e=>
>>
>>
>>
>> --
>> Cheers,
>> Tony
>>
>> On behalf of the Apache Infrastructure Team
>>
>> -----------------------
>> http://www.pc-tony.com
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pc-2Dtony.com&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=_j67WueILi3vYFsR4jNWB1_Aoyd4OhQxRso-rmUSmB4&e=>
>> GPG - 3072D/2543E323
>> -----------------------
>>
>> > On 3 Feb 2016, at 08:58, Karanjeet Singh <ka...@usc.edu> wrote:
>> >
>> > Thanks Pierre for your feedback.
>> >
>> > Yes, the visualization corresponds to only 133 / 191 SVN projects (
>> > http://svn.apache.org/repos/asf/
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_repos_asf_&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=xCX9GMMgDA5qtKvRJHDNZee5gprmc0l0d06PjjB4DE8&e=>).
>> We have successfully audited close to
>>
>> > 175 projects and hopefully by the end of this week all the remaining
>> > projects should be covered. We will update the data once done.
>> >
>> > Large repositories like "subversion" and "camel" having 493,420 files
>> (size
>> > - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up
>> to
>> > 36 hours (only) to complete which is quite a good number.
>> >
>> > For your second question, I don't have an answer yet. Our intentions
>> will
>> > be to update this regularly but we have some limitation at the Wrangler
>> end
>> > that it doesn't allow us to run a job for more than 48 hours. Therefore,
>> > for very large repositories like openoffice, spamassassin, myfaces, etc,
>> > which takes more time to get audited, it will be a challenge to split
>> the
>> > repositories every time and scan.
>> >
>> > Best Regards,
>> > Karanjeet Singh
>> > CS Graduate Student
>> > University of Southern California
>> > karanjes@usc.edu | +1-213-675-9583
>> >
>> >
>> > On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
>> > wrote:
>> >
>> >> HI Karanjeet,
>> >>
>> >> This is surely an impressive piece of work. But I still notice that
>> some
>> >> projects are missing in the overview. Is this a mere PoC not intended
>> to be
>> >> complete? Or something that will be made available to all and be
>> updated
>> >> regularly?
>> >>
>> >> Best regards,
>> >>
>> >> Pierre Smits
>> >>
>> >> ORRTIZ.COM
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ORRTIZ.COM&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=huYGKDzK8FadQqoFw9-pi5_UxtIkWwv4jTfWLbDwFIs&e=>
>> <
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e=
>> >>>
>> >> OFBiz based solutions & services
>> >>
>> >> OFBiz Extensions Marketplace
>> >>
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
>> >>
>> >> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
>> >> lewis.mcgibbney@gmail.com> wrote:
>> >>
>> >>> Hi Karanjeet,
>> >>>
>> >>> A good bunch of work has lready gone into this and it is looking
>> really
>> >>> friggin smart indeed.
>> >>> Interesting to see some many pieces of software come together and
>> result
>> >> in
>> >>> something very easy to interpret.
>> >>> Good work.
>> >>> Lewis
>> >>>
>> >>> On Mon, Feb 1, 2016 at 11:44 PM, <
>> dev-digest-help@community.apache.org>
>> >>> wrote:
>> >>>
>> >>>> Hello Everyone,
>> >>>>
>> >>>> With great pleasure, I would like to introduce DRAT (Distributed
>> >> Release
>> >>>> Audit Tool) which is a distributed, parallelized wrapper around
>> Apache
>> >>> RAT
>> >>>> to inspect for appropriate open source licensing in software
>> projects.
>> >>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get
>> RAT
>> >>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika,
>> >> and
>> >>>> Apache Solr.
>> >>>>
>> >>>> We are now auditing the complete Apache SVN code base to check for
>> >> proper
>> >>>> licenses. Until now, we have scanned 171 / 191 repositories and
>> >>>> illustrated the statistics for 133 of them through D3 visualization
>> >>>> located at
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
>> >>>>
>> >>>> Projects should check out the MIME analysis of the code base and
>> click
>> >>>> around. Please also note due to the sheer size of the Apache code
>> bases
>> >>>> and the fact that we scanned and included all revisions in the Apache
>> >> SVN
>> >>>> repo, DRAT is not running in real time. We are running DRAT on the
>> NSF
>> >>>> Super Computer Wrangler, which has a petabyte of flash storage and
>> the
>> >>>> ability to stand up Hadoop and Spark clusters. We are also working
>> on a
>> >>>> paper describing our results.
>> >>>>
>> >>>> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
>> >>>> Professor Mattmann <ma...@usc.edu> and/or
>> >> irds-L@mymaillists.usc.edu.
>> >>>>
>> >>>> Thanks & Regards,
>> >>>> Karanjeet Singh
>> >>>> C.S. Graduate Student
>> >>>> University of Southern California
>> >>>> karanjes@usc.edu | +1-213-675-9583
>> >>>
>> >>
>>
>>

Re: DRAT is now scanning Apache SVN code base!

Posted by Don Cunningham <ot...@gmail.com>.
On Feb 3, 2016 4:06 AM, "Tony Stevenson" <to...@pc-tony.com> wrote:

> cc += infra@
>
> Karanjeet,
>
> I am writing to you whilst wearing my Infrastructure hat.
>
> Please be careful if you are indeed recursing the entire ASF subversion
> repository (http://svn.apache.org) - as you will quite likely run into
> the aug-banning service.
> Have you seen https://svn-dump.apache.org ?  This is an entire dump of
> the SVN repo (at least the public one you are interested in. You can use
> this, and it is updated monthly. If you really need fully upto date data
> you can use the dump, and svnsync the remaining revisions.
>
> I guess this might be obvious, but I’ll mention it just in case.  A lot of
> projects are using git repositories too. Which are mirrored here:
> github.com/apache/
>
>
>
> --
> Cheers,
> Tony
>
> On behalf of the Apache Infrastructure Team
>
> -----------------------
> http://www.pc-tony.com
> GPG - 3072D/2543E323
> -----------------------
>
> > On 3 Feb 2016, at 08:58, Karanjeet Singh <ka...@usc.edu> wrote:
> >
> > Thanks Pierre for your feedback.
> >
> > Yes, the visualization corresponds to only 133 / 191 SVN projects (
> > http://svn.apache.org/repos/asf/). We have successfully audited close to
> > 175 projects and hopefully by the end of this week all the remaining
> > projects should be covered. We will update the data once done.
> >
> > Large repositories like "subversion" and "camel" having 493,420 files
> (size
> > - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up
> to
> > 36 hours (only) to complete which is quite a good number.
> >
> > For your second question, I don't have an answer yet. Our intentions will
> > be to update this regularly but we have some limitation at the Wrangler
> end
> > that it doesn't allow us to run a job for more than 48 hours. Therefore,
> > for very large repositories like openoffice, spamassassin, myfaces, etc,
> > which takes more time to get audited, it will be a challenge to split the
> > repositories every time and scan.
> >
> > Best Regards,
> > Karanjeet Singh
> > CS Graduate Student
> > University of Southern California
> > karanjes@usc.edu | +1-213-675-9583
> >
> >
> > On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
> > wrote:
> >
> >> HI Karanjeet,
> >>
> >> This is surely an impressive piece of work. But I still notice that some
> >> projects are missing in the overview. Is this a mere PoC not intended
> to be
> >> complete? Or something that will be made available to all and be updated
> >> regularly?
> >>
> >> Best regards,
> >>
> >> Pierre Smits
> >>
> >> ORRTIZ.COM <
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e=
> >>>
> >> OFBiz based solutions & services
> >>
> >> OFBiz Extensions Marketplace
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
> >>
> >> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >>> Hi Karanjeet,
> >>>
> >>> A good bunch of work has lready gone into this and it is looking really
> >>> friggin smart indeed.
> >>> Interesting to see some many pieces of software come together and
> result
> >> in
> >>> something very easy to interpret.
> >>> Good work.
> >>> Lewis
> >>>
> >>> On Mon, Feb 1, 2016 at 11:44 PM, <dev-digest-help@community.apache.org
> >
> >>> wrote:
> >>>
> >>>> Hello Everyone,
> >>>>
> >>>> With great pleasure, I would like to introduce DRAT (Distributed
> >> Release
> >>>> Audit Tool) which is a distributed, parallelized wrapper around Apache
> >>> RAT
> >>>> to inspect for appropriate open source licensing in software projects.
> >>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get
> RAT
> >>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika,
> >> and
> >>>> Apache Solr.
> >>>>
> >>>> We are now auditing the complete Apache SVN code base to check for
> >> proper
> >>>> licenses. Until now, we have scanned 171 / 191 repositories and
> >>>> illustrated the statistics for 133 of them through D3 visualization
> >>>> located at
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
> >>>>
> >>>> Projects should check out the MIME analysis of the code base and click
> >>>> around. Please also note due to the sheer size of the Apache code
> bases
> >>>> and the fact that we scanned and included all revisions in the Apache
> >> SVN
> >>>> repo, DRAT is not running in real time. We are running DRAT on the NSF
> >>>> Super Computer Wrangler, which has a petabyte of flash storage and the
> >>>> ability to stand up Hadoop and Spark clusters. We are also working on
> a
> >>>> paper describing our results.
> >>>>
> >>>> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
> >>>> Professor Mattmann <ma...@usc.edu> and/or
> >> irds-L@mymaillists.usc.edu.
> >>>>
> >>>> Thanks & Regards,
> >>>> Karanjeet Singh
> >>>> C.S. Graduate Student
> >>>> University of Southern California
> >>>> karanjes@usc.edu | +1-213-675-9583
> >>>
> >>
>
>

Re: DRAT is now scanning Apache SVN code base!

Posted by Tony Stevenson <to...@pc-tony.com>.
cc += infra@

Karanjeet,

I am writing to you whilst wearing my Infrastructure hat.  

Please be careful if you are indeed recursing the entire ASF subversion repository (http://svn.apache.org) - as you will quite likely run into the aug-banning service. 
Have you seen https://svn-dump.apache.org ?  This is an entire dump of the SVN repo (at least the public one you are interested in. You can use this, and it is updated monthly. If you really need fully upto date data you can use the dump, and svnsync the remaining revisions. 

I guess this might be obvious, but I’ll mention it just in case.  A lot of projects are using git repositories too. Which are mirrored here: github.com/apache/ 



--
Cheers,
Tony

On behalf of the Apache Infrastructure Team

-----------------------
http://www.pc-tony.com
GPG - 3072D/2543E323
-----------------------

> On 3 Feb 2016, at 08:58, Karanjeet Singh <ka...@usc.edu> wrote:
> 
> Thanks Pierre for your feedback.
> 
> Yes, the visualization corresponds to only 133 / 191 SVN projects (
> http://svn.apache.org/repos/asf/). We have successfully audited close to
> 175 projects and hopefully by the end of this week all the remaining
> projects should be covered. We will update the data once done.
> 
> Large repositories like "subversion" and "camel" having 493,420 files (size
> - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up to
> 36 hours (only) to complete which is quite a good number.
> 
> For your second question, I don't have an answer yet. Our intentions will
> be to update this regularly but we have some limitation at the Wrangler end
> that it doesn't allow us to run a job for more than 48 hours. Therefore,
> for very large repositories like openoffice, spamassassin, myfaces, etc,
> which takes more time to get audited, it will be a challenge to split the
> repositories every time and scan.
> 
> Best Regards,
> Karanjeet Singh
> CS Graduate Student
> University of Southern California
> karanjes@usc.edu | +1-213-675-9583
> 
> 
> On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
> wrote:
> 
>> HI Karanjeet,
>> 
>> This is surely an impressive piece of work. But I still notice that some
>> projects are missing in the overview. Is this a mere PoC not intended to be
>> complete? Or something that will be made available to all and be updated
>> regularly?
>> 
>> Best regards,
>> 
>> Pierre Smits
>> 
>> ORRTIZ.COM <
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e=
>>> 
>> OFBiz based solutions & services
>> 
>> OFBiz Extensions Marketplace
>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
>> 
>> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>> 
>>> Hi Karanjeet,
>>> 
>>> A good bunch of work has lready gone into this and it is looking really
>>> friggin smart indeed.
>>> Interesting to see some many pieces of software come together and result
>> in
>>> something very easy to interpret.
>>> Good work.
>>> Lewis
>>> 
>>> On Mon, Feb 1, 2016 at 11:44 PM, <de...@community.apache.org>
>>> wrote:
>>> 
>>>> Hello Everyone,
>>>> 
>>>> With great pleasure, I would like to introduce DRAT (Distributed
>> Release
>>>> Audit Tool) which is a distributed, parallelized wrapper around Apache
>>> RAT
>>>> to inspect for appropriate open source licensing in software projects.
>>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get RAT
>>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika,
>> and
>>>> Apache Solr.
>>>> 
>>>> We are now auditing the complete Apache SVN code base to check for
>> proper
>>>> licenses. Until now, we have scanned 171 / 191 repositories and
>>>> illustrated the statistics for 133 of them through D3 visualization
>>>> located at
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
>>>> 
>>>> Projects should check out the MIME analysis of the code base and click
>>>> around. Please also note due to the sheer size of the Apache code bases
>>>> and the fact that we scanned and included all revisions in the Apache
>> SVN
>>>> repo, DRAT is not running in real time. We are running DRAT on the NSF
>>>> Super Computer Wrangler, which has a petabyte of flash storage and the
>>>> ability to stand up Hadoop and Spark clusters. We are also working on a
>>>> paper describing our results.
>>>> 
>>>> Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
>>>> Professor Mattmann <ma...@usc.edu> and/or
>> irds-L@mymaillists.usc.edu.
>>>> 
>>>> Thanks & Regards,
>>>> Karanjeet Singh
>>>> C.S. Graduate Student
>>>> University of Southern California
>>>> karanjes@usc.edu | +1-213-675-9583
>>> 
>> 


Re: DRAT is now scanning Apache SVN code base!

Posted by Karanjeet Singh <ka...@usc.edu>.
Thanks Pierre for your feedback.

Yes, the visualization corresponds to only 133 / 191 SVN projects (
http://svn.apache.org/repos/asf/). We have successfully audited close to
175 projects and hopefully by the end of this week all the remaining
projects should be covered. We will update the data once done.

Large repositories like "subversion" and "camel" having 493,420 files (size
- 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up to
36 hours (only) to complete which is quite a good number.

For your second question, I don't have an answer yet. Our intentions will
be to update this regularly but we have some limitation at the Wrangler end
that it doesn't allow us to run a job for more than 48 hours. Therefore,
for very large repositories like openoffice, spamassassin, myfaces, etc,
which takes more time to get audited, it will be a challenge to split the
repositories every time and scan.

Best Regards,
Karanjeet Singh
CS Graduate Student
University of Southern California
karanjes@usc.edu | +1-213-675-9583


On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pi...@gmail.com>
wrote:

> HI Karanjeet,
>
> This is surely an impressive piece of work. But I still notice that some
> projects are missing in the overview. Is this a mere PoC not intended to be
> complete? Or something that will be made available to all and be updated
> regularly?
>
> Best regards,
>
> Pierre Smits
>
> ORRTIZ.COM <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e=
> >
> OFBiz based solutions & services
>
> OFBiz Extensions Marketplace
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e=
>
> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Karanjeet,
> >
> > A good bunch of work has lready gone into this and it is looking really
> > friggin smart indeed.
> > Interesting to see some many pieces of software come together and result
> in
> > something very easy to interpret.
> > Good work.
> > Lewis
> >
> > On Mon, Feb 1, 2016 at 11:44 PM, <de...@community.apache.org>
> > wrote:
> >
> > > Hello Everyone,
> > >
> > > With great pleasure, I would like to introduce DRAT (Distributed
> Release
> > > Audit Tool) which is a distributed, parallelized wrapper around Apache
> > RAT
> > > to inspect for appropriate open source licensing in software projects.
> > > DRAT was started by my advisor, Chris Mattmann, in an effort to get RAT
> > > working on a ver large code base. RAT uses Apache OODT, Apache Tika,
> and
> > > Apache Solr.
> > >
> > > We are now auditing the complete Apache SVN code base to check for
> proper
> > > licenses. Until now, we have scanned 171 / 191 repositories and
> > > illustrated the statistics for 133 of them through D3 visualization
> > > located at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e=
> > >
> > > Projects should check out the MIME analysis of the code base and click
> > > around. Please also note due to the sheer size of the Apache code bases
> > > and the fact that we scanned and included all revisions in the Apache
> SVN
> > > repo, DRAT is not running in real time. We are running DRAT on the NSF
> > > Super Computer Wrangler, which has a petabyte of flash storage and the
> > > ability to stand up Hadoop and Spark clusters. We are also working on a
> > > paper describing our results.
> > >
> > > Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
> > > Professor Mattmann <ma...@usc.edu> and/or
> irds-L@mymaillists.usc.edu.
> > >
> > > Thanks & Regards,
> > > Karanjeet Singh
> > > C.S. Graduate Student
> > > University of Southern California
> > > karanjes@usc.edu | +1-213-675-9583
> >
>

Re: DRAT is now scanning Apache SVN code base!

Posted by Pierre Smits <pi...@gmail.com>.
HI Karanjeet,

This is surely an impressive piece of work. But I still notice that some
projects are missing in the overview. Is this a mere PoC not intended to be
complete? Or something that will be made available to all and be updated
regularly?

Best regards,

Pierre Smits

ORRTIZ.COM <http://www.orrtiz.com>
OFBiz based solutions & services

OFBiz Extensions Marketplace
http://oem.ofbizci.net/oci-2/

On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Karanjeet,
>
> A good bunch of work has lready gone into this and it is looking really
> friggin smart indeed.
> Interesting to see some many pieces of software come together and result in
> something very easy to interpret.
> Good work.
> Lewis
>
> On Mon, Feb 1, 2016 at 11:44 PM, <de...@community.apache.org>
> wrote:
>
> > Hello Everyone,
> >
> > With great pleasure, I would like to introduce DRAT (Distributed Release
> > Audit Tool) which is a distributed, parallelized wrapper around Apache
> RAT
> > to inspect for appropriate open source licensing in software projects.
> > DRAT was started by my advisor, Chris Mattmann, in an effort to get RAT
> > working on a ver large code base. RAT uses Apache OODT, Apache Tika, and
> > Apache Solr.
> >
> > We are now auditing the complete Apache SVN code base to check for proper
> > licenses. Until now, we have scanned 171 / 191 repositories and
> > illustrated the statistics for 133 of them through D3 visualization
> > located at http://drat.dyndns.org:8080/dratviz
> >
> > Projects should check out the MIME analysis of the code base and click
> > around. Please also note due to the sheer size of the Apache code bases
> > and the fact that we scanned and included all revisions in the Apache SVN
> > repo, DRAT is not running in real time. We are running DRAT on the NSF
> > Super Computer Wrangler, which has a petabyte of flash storage and the
> > ability to stand up Hadoop and Spark clusters. We are also working on a
> > paper describing our results.
> >
> > Please send feedback to myself (Karanjeet Singh <ka...@usc.edu>),
> > Professor Mattmann <ma...@usc.edu> and/or irds-L@mymaillists.usc.edu.
> >
> > Thanks & Regards,
> > Karanjeet Singh
> > C.S. Graduate Student
> > University of Southern California
> > karanjes@usc.edu | +1-213-675-9583
>