You are viewing a plain text version of this content. The canonical link for it is here.

Posted to infrastructure-dev@apache.org by Steven Lloyd Wilson <sl...@wisc.edu> on 2013/10/22 04:03:16 UTC

Data question

Hello,

I first emailed the media contact with Apache and he recommended that I 
resend to this list, with a somewhat unorthodox request for 
information/direction.

I'm a PhD student writing a dissertation on the effects of the Internet 
on politics around the world. One of the variables that I'm looking at 
is how technically literate the populations of different countries are. 
The way I'm measuring this is through a variety of sources getting at 
open source downloads, usage of open source software, etc.

I've found some excellent information on Apache's site, including the 
map of where contributors are located, so I think that somewhere behind 
the scenes should be the specific data that I'm looking for: the numbers 
of download for each project, number of contributors, number of mirrors, 
etc. by year and country, since the start of the Apache Foundation.

Could you point me in the right direction on this matter?

Thanks!
Steven Wilson
PhD Candidate in Political Science
University of Wisconsin-Madison

Re: Data question

Posted by janI <ja...@apache.org>.

On 25 October 2013 05:59, Alex Harui <ah...@adobe.com> wrote:

> Sounds like an interesting challenge, but I still wonder what conclusions
> you can really draw.  If two PhD students happen to choose the same thesis
> topic and spend the same number of days writing their theses and somehow
> write the exact same thesis, but one saves drafts to a central server
> along the way while the other just keeps local backups and only delivers
> the final draft, was one really more active than the other?  The first one
> will show more "commits".
>

Exactly, that happened when I did the analysis on the AOO project. One of
the web translators showed up as the most active person, because every
single file was a commit, while the person who developed a very big
feature, at least 5 month of work offline, only had a single commit with a
100 files or more.

In my opinion you cannot use these figures for type of compare. But the
idea of looking at the ML is not bad it will show a lot more how the
communities are active. Cross referece an extract from MLs with people.a.o
and you do have a "community activity level" for each country.

rgds
jan I.



>
> -Alex
>
> On 10/24/13 8:51 PM, "Upayavira" <uv...@odoko.co.uk> wrote:
>
> >And, all commits go to publicly archived commit lists, no? Meaning if
> >you engage with mailing lists, you get access to commit activity for
> >free.
> >
> >The only missing part though is geography, knowing where in the world a
> >committer is, and with the advent of powerful web mail services, this
> >gets harder still, unless the committer publishes a FOAF file (I think
> >that's what it is called?)
> >
> >Upayavira
> >
> >On Thu, Oct 24, 2013, at 08:44 PM, janI wrote:
> >> On 24 October 2013 20:24, Santiago Gala <sa...@gmail.com>
> wrote:
> >>
> >> > On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson
> >><slwilson4@wisc.edu
> >> > >wrote:
> >> >
> >> > > Hi Jan,
> >> > >
> >> > > My thought was that #commits would give an idea of how active the
> >> > > developers in that country were, in order to distinguish between a
> >> > country
> >> > > with a handful of developers that periodically commit, and a country
> >> > with a
> >> > > handful of developers that happen to be extraordinarily active ones.
> >> > >
> >> >
> >>
> >> As I tried to explain, you will get data that cannot be compared
> >> statistically. But of course it still gives an indication of activity
> >> level.
> >>
> >> These data are not back-end data, the apache project repos are publicly
> >> available, making it is possible for you to extract the repo log data.
> >> You
> >> need to cross reference the log data with data from people.apache.org.
> >>
> >>
> >> > >
> >> > Note that:
> >> > * the commits are typically surrounded by technical discussion in the
> >>devel
> >> > list
> >> > * for each commit an email is sent to a public list.
> >> >
> >> > You can reasonably infer the numbers you are looking for just using
> >>the
> >> > public email archive plus an analysis of email aliases and domains...
> >> >
> >> > This is the approach I decided to take in my Master Thesis, mostly to
> >>avoid
> >> > depending of the effort of other people...
> >> >
> >>
> >> This is a very good approach, when looking for activity levels, because
> >> it
> >> includes QA, documentation and all the items around the programming.
> >>
> >> Just to be clear, to my best knowledge, we dont have much better data
> >> internally, and as PCtony wrote, we would need extremely good reasons to
> >> provide information, which committers have chosen not to make publicly
> >> available.
> >>
> >> rgds
> >> jan I.
> >>
> >>
> >> >
> >> > Regards
> >> > Santiago
> >> >
> >> >
> >> > > What I'm trying to measure is the technical capability of the
> >>population,
> >> > > using open source activity (both in terms of development and use)
> >>as a
> >> > > proxy variable. I'm definitely open to suggestions of data that is
> >> > > available on the backend that might work better for this, but as
> >>Tony
> >> > > suggested, I made my best stab at what I thought would be good
> >>measures,
> >> > > and something that is likely to exist in your data.
> >> > >
> >> > > Best,
> >> > > Steven
> >> > >
> >> > >
> >> > >
> >> > > >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu>
> >> > wrote:
> >> > > >
> >> > > >> Thanks for the quick reply Tony.
> >> > > >>
> >> > > >> I certainly appreciate the need for keeping the personal data
> >>private,
> >> > > and
> >> > > >> have no interest in collecting data at that level. I'm looking
> >>for
> >> > > country
> >> > > >> level data, ideally year-by-year.
> >> > > >>
> >> > > >> So as a starting point, an output like this would be ideal:
> >>country,
> >> > > year,
> >> > > >> # of developers, # of total commits. For example:
> >> > > >> Mexico, 2009, 105, 5213
> >> > > >> Mexico, 2010, 117, 5598
> >> > > >>
> >> > > >
> >> > > >HI
> >> > > >just out of curiosity, why do you think #commits is a significant
> >>value
> >> > ?
> >> > > >
> >> > > >I tried to make a "top 10", for one of the bigger projects
> >> > > >(ApacheOpenOffice), and it turned out that a couple of the most
> >>active
> >> > > >active committers  did not even reach "top 10". Reason was that
> >>these
> >> > > >committers had few commits, but each commit contained with a lot of
> >> > files,
> >> > > >where as some of the web committers tended to do a commit for every
> >> > file.
> >> > > >Btw extracting the data from svn was very network demanding.
> >> > > >
> >> > > >rgds
> >> > > >jan I.
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >> et cetera.
> >> > > >>
> >> > > >> Would that be something that would be easily extractable from the
> >> > > backend?
> >> > > >>
> >> > > >> Steven
> >> > > >>
> >> > > >>
> >> > > >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
> >> > > >>
> >> > > >>> Steven,
> >> > > >>>
> >> > > >>> Some of this information won't be so easy to get.  For example
> >>we
> >> > > cannot
> >> > > >>> tell you how many downloads each project has had, as almost all
> >>of
> >> > that
> >> > > >>> data is held locally by the mirrors and we don't currently
> >>collect
> >> > it.
> >> > > >>>
> >> > > >>> Other data is a little easier to collect, but I'm afraid some
> >>of it
> >> > is
> >> > > >>> likely considered personal data, so we'd almost certainly not
> >>release
> >> > > it to
> >> > > >>> a 3rd party. This is mostly because the data you have already
> >>found
> >> > is
> >> > > >>> constructed from other data, which is interspersed with some
> >>personal
> >> > > data.
> >> > > >>> A lot of the data will unfortunately be stored within the SVN
> >> > history.
> >> > > >>> However I suspect a lot of it will not be contained within the
> >>public
> >> > > repo,
> >> > > >>> though clearly some of it will be.
> >> > > >>>
> >> > > >>> The only way I can be more helpful to you, I think, is to ask
> >>you to
> >> > > give
> >> > > >>> us some specific requests for data and I can let you know if we
> >>can
> >> > > either
> >> > > >>> get that data, and if we are able to distribute it.  I realise
> >>this
> >> > > may not
> >> > > >>> be as helpful as you want, but we are prudent about releasing
> >>data.
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson
> >><sl...@wisc.edu>
> >> > > wrote:
> >> > > >>>
> >> > > >>>  Hello,
> >> > > >>>>
> >> > > >>>> I first emailed the media contact with Apache and he
> >>recommended
> >> > that
> >> > > I
> >> > > >>>> resend to this list, with a somewhat unorthodox request for
> >> > > >>>> information/direction.
> >> > > >>>>
> >> > > >>>> I'm a PhD student writing a dissertation on the effects of the
> >> > > Internet
> >> > > >>>> on politics around the world. One of the variables that I'm
> >>looking
> >> > > at is
> >> > > >>>> how technically literate the populations of different
> >>countries are.
> >> > > The
> >> > > >>>> way I'm measuring this is through a variety of sources getting
> >>at
> >> > open
> >> > > >>>> source downloads, usage of open source software, etc.
> >> > > >>>>
> >> > > >>>> I've found some excellent information on Apache's site,
> >>including
> >> > the
> >> > > >>>> map of where contributors are located, so I think that
> >>somewhere
> >> > > behind the
> >> > > >>>> scenes should be the specific data that I'm looking for: the
> >>numbers
> >> > > of
> >> > > >>>> download for each project, number of contributors, number of
> >> > mirrors,
> >> > > etc.
> >> > > >>>> by year and country, since the start of the Apache Foundation.
> >> > > >>>>
> >> > > >>>> Could you point me in the right direction on this matter?
> >> > > >>>>
> >> > > >>>> Thanks!
> >> > > >>>> Steven Wilson
> >> > > >>>> PhD Candidate in Political Science
> >> > > >>>> University of Wisconsin-Madison
> >> > > >>>>
> >> > > >>>
> >> > > >>> Cheers,
> >> > > >>> Tony
> >> > > >>>
> >> > > >>> ------------------------------****----
> >> > >
> >> > > >>> Tony Stevenson
> >> > > >>>
> >> > > >>> tony@pc-tony.com
> >> > > >>> pctony@apache.org
> >> > > >>>
> >> > > >>> http://www.pc-tony.com
> >> > > >>>
> >> > > >>> GPG - 1024D/51047D66
> >> > > >>> ------------------------------****----
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>
> >> > >
> >> >
>
>

Re: Data question

Posted by Tony Stevenson <to...@pc-tony.com>.

On 25 Oct 2013, at 07:32, janI <ja...@apache.org> wrote:

> On 25 October 2013 05:51, Upayavira <uv...@odoko.co.uk> wrote:
> 
>> And, all commits go to publicly archived commit lists, no? Meaning if
>> you engage with mailing lists, you get access to commit activity for
>> free.
>> 
> yes, thats the (for our bandwidth) free way of doing this, another is to
> "svn log" with a couple of options on the top of the repos.

Please don’t ever do this. Take the dump from http://svn-master.apache.org/dump/ and use that locally - else you will add unnecessary load on eris/harmonia.

Jan, please refrain from telling people who are mining for data from using the live host, the dump file is there for a reason (so that we don’t have to sustain high IO on the live nodes). 


>> 
>> The only missing part though is geography, knowing where in the world a
>> committer is, and with the advent of powerful web mail services, this
>> gets harder still, unless the committer publishes a FOAF file (I think
>> that's what it is called?)
>> 
> 
> thats the name, and part of it (the part you need) is available on
> people.a.o, f.x. take the google map (google produces a file with all the
> markers) or go through http://people.apache.org/committers.html
> 
> both ways will work, but you need to do the data mining, the data wont be
> served on a silver plate :-)
> 
> rgds
> jan I
> 
>> 
>> Upayavira
>> 
>> On Thu, Oct 24, 2013, at 08:44 PM, janI wrote:
>>> On 24 October 2013 20:24, Santiago Gala <sa...@gmail.com> wrote:
>>> 
>>>> On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson <
>> slwilson4@wisc.edu
>>>>> wrote:
>>>> 
>>>>> Hi Jan,
>>>>> 
>>>>> My thought was that #commits would give an idea of how active the
>>>>> developers in that country were, in order to distinguish between a
>>>> country
>>>>> with a handful of developers that periodically commit, and a country
>>>> with a
>>>>> handful of developers that happen to be extraordinarily active ones.
>>>>> 
>>>> 
>>> 
>>> As I tried to explain, you will get data that cannot be compared
>>> statistically. But of course it still gives an indication of activity
>>> level.
>>> 
>>> These data are not back-end data, the apache project repos are publicly
>>> available, making it is possible for you to extract the repo log data.
>>> You
>>> need to cross reference the log data with data from people.apache.org.
>>> 
>>> 
>>>>> 
>>>> Note that:
>>>> * the commits are typically surrounded by technical discussion in the
>> devel
>>>> list
>>>> * for each commit an email is sent to a public list.
>>>> 
>>>> You can reasonably infer the numbers you are looking for just using the
>>>> public email archive plus an analysis of email aliases and domains...
>>>> 
>>>> This is the approach I decided to take in my Master Thesis, mostly to
>> avoid
>>>> depending of the effort of other people...
>>>> 
>>> 
>>> This is a very good approach, when looking for activity levels, because
>>> it
>>> includes QA, documentation and all the items around the programming.
>>> 
>>> Just to be clear, to my best knowledge, we dont have much better data
>>> internally, and as PCtony wrote, we would need extremely good reasons to
>>> provide information, which committers have chosen not to make publicly
>>> available.
>>> 
>>> rgds
>>> jan I.
>>> 
>>> 
>>>> 
>>>> Regards
>>>> Santiago
>>>> 
>>>> 
>>>>> What I'm trying to measure is the technical capability of the
>> population,
>>>>> using open source activity (both in terms of development and use) as
>> a
>>>>> proxy variable. I'm definitely open to suggestions of data that is
>>>>> available on the backend that might work better for this, but as Tony
>>>>> suggested, I made my best stab at what I thought would be good
>> measures,
>>>>> and something that is likely to exist in your data.
>>>>> 
>>>>> Best,
>>>>> Steven
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu>
>>>> wrote:
>>>>>> 
>>>>>>> Thanks for the quick reply Tony.
>>>>>>> 
>>>>>>> I certainly appreciate the need for keeping the personal data
>> private,
>>>>> and
>>>>>>> have no interest in collecting data at that level. I'm looking for
>>>>> country
>>>>>>> level data, ideally year-by-year.
>>>>>>> 
>>>>>>> So as a starting point, an output like this would be ideal:
>> country,
>>>>> year,
>>>>>>> # of developers, # of total commits. For example:
>>>>>>> Mexico, 2009, 105, 5213
>>>>>>> Mexico, 2010, 117, 5598
>>>>>>> 
>>>>>> 
>>>>>> HI
>>>>>> just out of curiosity, why do you think #commits is a significant
>> value
>>>> ?
>>>>>> 
>>>>>> I tried to make a "top 10", for one of the bigger projects
>>>>>> (ApacheOpenOffice), and it turned out that a couple of the most
>> active
>>>>>> active committers  did not even reach "top 10". Reason was that
>> these
>>>>>> committers had few commits, but each commit contained with a lot of
>>>> files,
>>>>>> where as some of the web committers tended to do a commit for every
>>>> file.
>>>>>> Btw extracting the data from svn was very network demanding.
>>>>>> 
>>>>>> rgds
>>>>>> jan I.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> et cetera.
>>>>>>> 
>>>>>>> Would that be something that would be easily extractable from the
>>>>> backend?
>>>>>>> 
>>>>>>> Steven
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
>>>>>>> 
>>>>>>>> Steven,
>>>>>>>> 
>>>>>>>> Some of this information won't be so easy to get.  For example we
>>>>> cannot
>>>>>>>> tell you how many downloads each project has had, as almost all
>> of
>>>> that
>>>>>>>> data is held locally by the mirrors and we don't currently
>> collect
>>>> it.
>>>>>>>> 
>>>>>>>> Other data is a little easier to collect, but I'm afraid some of
>> it
>>>> is
>>>>>>>> likely considered personal data, so we'd almost certainly not
>> release
>>>>> it to
>>>>>>>> a 3rd party. This is mostly because the data you have already
>> found
>>>> is
>>>>>>>> constructed from other data, which is interspersed with some
>> personal
>>>>> data.
>>>>>>>> A lot of the data will unfortunately be stored within the SVN
>>>> history.
>>>>>>>> However I suspect a lot of it will not be contained within the
>> public
>>>>> repo,
>>>>>>>> though clearly some of it will be.
>>>>>>>> 
>>>>>>>> The only way I can be more helpful to you, I think, is to ask
>> you to
>>>>> give
>>>>>>>> us some specific requests for data and I can let you know if we
>> can
>>>>> either
>>>>>>>> get that data, and if we are able to distribute it.  I realise
>> this
>>>>> may not
>>>>>>>> be as helpful as you want, but we are prudent about releasing
>> data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <
>> slwilson4@wisc.edu>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I first emailed the media contact with Apache and he recommended
>>>> that
>>>>> I
>>>>>>>>> resend to this list, with a somewhat unorthodox request for
>>>>>>>>> information/direction.
>>>>>>>>> 
>>>>>>>>> I'm a PhD student writing a dissertation on the effects of the
>>>>> Internet
>>>>>>>>> on politics around the world. One of the variables that I'm
>> looking
>>>>> at is
>>>>>>>>> how technically literate the populations of different countries
>> are.
>>>>> The
>>>>>>>>> way I'm measuring this is through a variety of sources getting
>> at
>>>> open
>>>>>>>>> source downloads, usage of open source software, etc.
>>>>>>>>> 
>>>>>>>>> I've found some excellent information on Apache's site,
>> including
>>>> the
>>>>>>>>> map of where contributors are located, so I think that somewhere
>>>>> behind the
>>>>>>>>> scenes should be the specific data that I'm looking for: the
>> numbers
>>>>> of
>>>>>>>>> download for each project, number of contributors, number of
>>>> mirrors,
>>>>> etc.
>>>>>>>>> by year and country, since the start of the Apache Foundation.
>>>>>>>>> 
>>>>>>>>> Could you point me in the right direction on this matter?
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> Steven Wilson
>>>>>>>>> PhD Candidate in Political Science
>>>>>>>>> University of Wisconsin-Madison
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Tony
>>>>>>>> 
>>>>>>>> ------------------------------****----
>>>>> 
>>>>>>>> Tony Stevenson
>>>>>>>> 
>>>>>>>> tony@pc-tony.com
>>>>>>>> pctony@apache.org
>>>>>>>> 
>>>>>>>> http://www.pc-tony.com
>>>>>>>> 
>>>>>>>> GPG - 1024D/51047D66
>>>>>>>> ------------------------------****----
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 


Cheers,
Tony

----------------------------------
Tony Stevenson

tony@pc-tony.com
pctony@apache.org

http://www.pc-tony.com

GPG - 1024D/51047D66
----------------------------------

Re: Data question

Posted by janI <ja...@apache.org>.

On 25 October 2013 05:51, Upayavira <uv...@odoko.co.uk> wrote:

> And, all commits go to publicly archived commit lists, no? Meaning if
> you engage with mailing lists, you get access to commit activity for
> free.
>
yes, thats the (for our bandwidth) free way of doing this, another is to
"svn log" with a couple of options on the top of the repos.


>
> The only missing part though is geography, knowing where in the world a
> committer is, and with the advent of powerful web mail services, this
> gets harder still, unless the committer publishes a FOAF file (I think
> that's what it is called?)
>

thats the name, and part of it (the part you need) is available on
people.a.o, f.x. take the google map (google produces a file with all the
markers) or go through http://people.apache.org/committers.html

both ways will work, but you need to do the data mining, the data wont be
served on a silver plate :-)

rgds
jan I

>
> Upayavira
>
> On Thu, Oct 24, 2013, at 08:44 PM, janI wrote:
> > On 24 October 2013 20:24, Santiago Gala <sa...@gmail.com> wrote:
> >
> > > On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson <
> slwilson4@wisc.edu
> > > >wrote:
> > >
> > > > Hi Jan,
> > > >
> > > > My thought was that #commits would give an idea of how active the
> > > > developers in that country were, in order to distinguish between a
> > > country
> > > > with a handful of developers that periodically commit, and a country
> > > with a
> > > > handful of developers that happen to be extraordinarily active ones.
> > > >
> > >
> >
> > As I tried to explain, you will get data that cannot be compared
> > statistically. But of course it still gives an indication of activity
> > level.
> >
> > These data are not back-end data, the apache project repos are publicly
> > available, making it is possible for you to extract the repo log data.
> > You
> > need to cross reference the log data with data from people.apache.org.
> >
> >
> > > >
> > > Note that:
> > > * the commits are typically surrounded by technical discussion in the
> devel
> > > list
> > > * for each commit an email is sent to a public list.
> > >
> > > You can reasonably infer the numbers you are looking for just using the
> > > public email archive plus an analysis of email aliases and domains...
> > >
> > > This is the approach I decided to take in my Master Thesis, mostly to
> avoid
> > > depending of the effort of other people...
> > >
> >
> > This is a very good approach, when looking for activity levels, because
> > it
> > includes QA, documentation and all the items around the programming.
> >
> > Just to be clear, to my best knowledge, we dont have much better data
> > internally, and as PCtony wrote, we would need extremely good reasons to
> > provide information, which committers have chosen not to make publicly
> > available.
> >
> > rgds
> > jan I.
> >
> >
> > >
> > > Regards
> > > Santiago
> > >
> > >
> > > > What I'm trying to measure is the technical capability of the
> population,
> > > > using open source activity (both in terms of development and use) as
> a
> > > > proxy variable. I'm definitely open to suggestions of data that is
> > > > available on the backend that might work better for this, but as Tony
> > > > suggested, I made my best stab at what I thought would be good
> measures,
> > > > and something that is likely to exist in your data.
> > > >
> > > > Best,
> > > > Steven
> > > >
> > > >
> > > >
> > > > >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu>
> > > wrote:
> > > > >
> > > > >> Thanks for the quick reply Tony.
> > > > >>
> > > > >> I certainly appreciate the need for keeping the personal data
> private,
> > > > and
> > > > >> have no interest in collecting data at that level. I'm looking for
> > > > country
> > > > >> level data, ideally year-by-year.
> > > > >>
> > > > >> So as a starting point, an output like this would be ideal:
> country,
> > > > year,
> > > > >> # of developers, # of total commits. For example:
> > > > >> Mexico, 2009, 105, 5213
> > > > >> Mexico, 2010, 117, 5598
> > > > >>
> > > > >
> > > > >HI
> > > > >just out of curiosity, why do you think #commits is a significant
> value
> > > ?
> > > > >
> > > > >I tried to make a "top 10", for one of the bigger projects
> > > > >(ApacheOpenOffice), and it turned out that a couple of the most
> active
> > > > >active committers  did not even reach "top 10". Reason was that
> these
> > > > >committers had few commits, but each commit contained with a lot of
> > > files,
> > > > >where as some of the web committers tended to do a commit for every
> > > file.
> > > > >Btw extracting the data from svn was very network demanding.
> > > > >
> > > > >rgds
> > > > >jan I.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >> et cetera.
> > > > >>
> > > > >> Would that be something that would be easily extractable from the
> > > > backend?
> > > > >>
> > > > >> Steven
> > > > >>
> > > > >>
> > > > >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
> > > > >>
> > > > >>> Steven,
> > > > >>>
> > > > >>> Some of this information won't be so easy to get.  For example we
> > > > cannot
> > > > >>> tell you how many downloads each project has had, as almost all
> of
> > > that
> > > > >>> data is held locally by the mirrors and we don't currently
> collect
> > > it.
> > > > >>>
> > > > >>> Other data is a little easier to collect, but I'm afraid some of
> it
> > > is
> > > > >>> likely considered personal data, so we'd almost certainly not
> release
> > > > it to
> > > > >>> a 3rd party. This is mostly because the data you have already
> found
> > > is
> > > > >>> constructed from other data, which is interspersed with some
> personal
> > > > data.
> > > > >>> A lot of the data will unfortunately be stored within the SVN
> > > history.
> > > > >>> However I suspect a lot of it will not be contained within the
> public
> > > > repo,
> > > > >>> though clearly some of it will be.
> > > > >>>
> > > > >>> The only way I can be more helpful to you, I think, is to ask
> you to
> > > > give
> > > > >>> us some specific requests for data and I can let you know if we
> can
> > > > either
> > > > >>> get that data, and if we are able to distribute it.  I realise
> this
> > > > may not
> > > > >>> be as helpful as you want, but we are prudent about releasing
> data.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <
> slwilson4@wisc.edu>
> > > > wrote:
> > > > >>>
> > > > >>>  Hello,
> > > > >>>>
> > > > >>>> I first emailed the media contact with Apache and he recommended
> > > that
> > > > I
> > > > >>>> resend to this list, with a somewhat unorthodox request for
> > > > >>>> information/direction.
> > > > >>>>
> > > > >>>> I'm a PhD student writing a dissertation on the effects of the
> > > > Internet
> > > > >>>> on politics around the world. One of the variables that I'm
> looking
> > > > at is
> > > > >>>> how technically literate the populations of different countries
> are.
> > > > The
> > > > >>>> way I'm measuring this is through a variety of sources getting
> at
> > > open
> > > > >>>> source downloads, usage of open source software, etc.
> > > > >>>>
> > > > >>>> I've found some excellent information on Apache's site,
> including
> > > the
> > > > >>>> map of where contributors are located, so I think that somewhere
> > > > behind the
> > > > >>>> scenes should be the specific data that I'm looking for: the
> numbers
> > > > of
> > > > >>>> download for each project, number of contributors, number of
> > > mirrors,
> > > > etc.
> > > > >>>> by year and country, since the start of the Apache Foundation.
> > > > >>>>
> > > > >>>> Could you point me in the right direction on this matter?
> > > > >>>>
> > > > >>>> Thanks!
> > > > >>>> Steven Wilson
> > > > >>>> PhD Candidate in Political Science
> > > > >>>> University of Wisconsin-Madison
> > > > >>>>
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Tony
> > > > >>>
> > > > >>> ------------------------------****----
> > > >
> > > > >>> Tony Stevenson
> > > > >>>
> > > > >>> tony@pc-tony.com
> > > > >>> pctony@apache.org
> > > > >>>
> > > > >>> http://www.pc-tony.com
> > > > >>>
> > > > >>> GPG - 1024D/51047D66
> > > > >>> ------------------------------****----
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
>

Re: Data question

Posted by Alex Harui <ah...@adobe.com>.

Sounds like an interesting challenge, but I still wonder what conclusions
you can really draw.  If two PhD students happen to choose the same thesis
topic and spend the same number of days writing their theses and somehow
write the exact same thesis, but one saves drafts to a central server
along the way while the other just keeps local backups and only delivers
the final draft, was one really more active than the other?  The first one
will show more "commits".

-Alex

On 10/24/13 8:51 PM, "Upayavira" <uv...@odoko.co.uk> wrote:

>And, all commits go to publicly archived commit lists, no? Meaning if
>you engage with mailing lists, you get access to commit activity for
>free.
>
>The only missing part though is geography, knowing where in the world a
>committer is, and with the advent of powerful web mail services, this
>gets harder still, unless the committer publishes a FOAF file (I think
>that's what it is called?)
>
>Upayavira
>
>On Thu, Oct 24, 2013, at 08:44 PM, janI wrote:
>> On 24 October 2013 20:24, Santiago Gala <sa...@gmail.com> wrote:
>> 
>> > On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson
>><slwilson4@wisc.edu
>> > >wrote:
>> >
>> > > Hi Jan,
>> > >
>> > > My thought was that #commits would give an idea of how active the
>> > > developers in that country were, in order to distinguish between a
>> > country
>> > > with a handful of developers that periodically commit, and a country
>> > with a
>> > > handful of developers that happen to be extraordinarily active ones.
>> > >
>> >
>> 
>> As I tried to explain, you will get data that cannot be compared
>> statistically. But of course it still gives an indication of activity
>> level.
>> 
>> These data are not back-end data, the apache project repos are publicly
>> available, making it is possible for you to extract the repo log data.
>> You
>> need to cross reference the log data with data from people.apache.org.
>> 
>> 
>> > >
>> > Note that:
>> > * the commits are typically surrounded by technical discussion in the
>>devel
>> > list
>> > * for each commit an email is sent to a public list.
>> >
>> > You can reasonably infer the numbers you are looking for just using
>>the
>> > public email archive plus an analysis of email aliases and domains...
>> >
>> > This is the approach I decided to take in my Master Thesis, mostly to
>>avoid
>> > depending of the effort of other people...
>> >
>> 
>> This is a very good approach, when looking for activity levels, because
>> it
>> includes QA, documentation and all the items around the programming.
>> 
>> Just to be clear, to my best knowledge, we dont have much better data
>> internally, and as PCtony wrote, we would need extremely good reasons to
>> provide information, which committers have chosen not to make publicly
>> available.
>> 
>> rgds
>> jan I.
>> 
>> 
>> >
>> > Regards
>> > Santiago
>> >
>> >
>> > > What I'm trying to measure is the technical capability of the
>>population,
>> > > using open source activity (both in terms of development and use)
>>as a
>> > > proxy variable. I'm definitely open to suggestions of data that is
>> > > available on the backend that might work better for this, but as
>>Tony
>> > > suggested, I made my best stab at what I thought would be good
>>measures,
>> > > and something that is likely to exist in your data.
>> > >
>> > > Best,
>> > > Steven
>> > >
>> > >
>> > >
>> > > >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu>
>> > wrote:
>> > > >
>> > > >> Thanks for the quick reply Tony.
>> > > >>
>> > > >> I certainly appreciate the need for keeping the personal data
>>private,
>> > > and
>> > > >> have no interest in collecting data at that level. I'm looking
>>for
>> > > country
>> > > >> level data, ideally year-by-year.
>> > > >>
>> > > >> So as a starting point, an output like this would be ideal:
>>country,
>> > > year,
>> > > >> # of developers, # of total commits. For example:
>> > > >> Mexico, 2009, 105, 5213
>> > > >> Mexico, 2010, 117, 5598
>> > > >>
>> > > >
>> > > >HI
>> > > >just out of curiosity, why do you think #commits is a significant
>>value
>> > ?
>> > > >
>> > > >I tried to make a "top 10", for one of the bigger projects
>> > > >(ApacheOpenOffice), and it turned out that a couple of the most
>>active
>> > > >active committers  did not even reach "top 10". Reason was that
>>these
>> > > >committers had few commits, but each commit contained with a lot of
>> > files,
>> > > >where as some of the web committers tended to do a commit for every
>> > file.
>> > > >Btw extracting the data from svn was very network demanding.
>> > > >
>> > > >rgds
>> > > >jan I.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >> et cetera.
>> > > >>
>> > > >> Would that be something that would be easily extractable from the
>> > > backend?
>> > > >>
>> > > >> Steven
>> > > >>
>> > > >>
>> > > >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
>> > > >>
>> > > >>> Steven,
>> > > >>>
>> > > >>> Some of this information won't be so easy to get.  For example
>>we
>> > > cannot
>> > > >>> tell you how many downloads each project has had, as almost all
>>of
>> > that
>> > > >>> data is held locally by the mirrors and we don't currently
>>collect
>> > it.
>> > > >>>
>> > > >>> Other data is a little easier to collect, but I'm afraid some
>>of it
>> > is
>> > > >>> likely considered personal data, so we'd almost certainly not
>>release
>> > > it to
>> > > >>> a 3rd party. This is mostly because the data you have already
>>found
>> > is
>> > > >>> constructed from other data, which is interspersed with some
>>personal
>> > > data.
>> > > >>> A lot of the data will unfortunately be stored within the SVN
>> > history.
>> > > >>> However I suspect a lot of it will not be contained within the
>>public
>> > > repo,
>> > > >>> though clearly some of it will be.
>> > > >>>
>> > > >>> The only way I can be more helpful to you, I think, is to ask
>>you to
>> > > give
>> > > >>> us some specific requests for data and I can let you know if we
>>can
>> > > either
>> > > >>> get that data, and if we are able to distribute it.  I realise
>>this
>> > > may not
>> > > >>> be as helpful as you want, but we are prudent about releasing
>>data.
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson
>><sl...@wisc.edu>
>> > > wrote:
>> > > >>>
>> > > >>>  Hello,
>> > > >>>>
>> > > >>>> I first emailed the media contact with Apache and he
>>recommended
>> > that
>> > > I
>> > > >>>> resend to this list, with a somewhat unorthodox request for
>> > > >>>> information/direction.
>> > > >>>>
>> > > >>>> I'm a PhD student writing a dissertation on the effects of the
>> > > Internet
>> > > >>>> on politics around the world. One of the variables that I'm
>>looking
>> > > at is
>> > > >>>> how technically literate the populations of different
>>countries are.
>> > > The
>> > > >>>> way I'm measuring this is through a variety of sources getting
>>at
>> > open
>> > > >>>> source downloads, usage of open source software, etc.
>> > > >>>>
>> > > >>>> I've found some excellent information on Apache's site,
>>including
>> > the
>> > > >>>> map of where contributors are located, so I think that
>>somewhere
>> > > behind the
>> > > >>>> scenes should be the specific data that I'm looking for: the
>>numbers
>> > > of
>> > > >>>> download for each project, number of contributors, number of
>> > mirrors,
>> > > etc.
>> > > >>>> by year and country, since the start of the Apache Foundation.
>> > > >>>>
>> > > >>>> Could you point me in the right direction on this matter?
>> > > >>>>
>> > > >>>> Thanks!
>> > > >>>> Steven Wilson
>> > > >>>> PhD Candidate in Political Science
>> > > >>>> University of Wisconsin-Madison
>> > > >>>>
>> > > >>>
>> > > >>> Cheers,
>> > > >>> Tony
>> > > >>>
>> > > >>> ------------------------------****----
>> > >
>> > > >>> Tony Stevenson
>> > > >>>
>> > > >>> tony@pc-tony.com
>> > > >>> pctony@apache.org
>> > > >>>
>> > > >>> http://www.pc-tony.com
>> > > >>>
>> > > >>> GPG - 1024D/51047D66
>> > > >>> ------------------------------****----
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> >

Re: Data question

Posted by Upayavira <uv...@odoko.co.uk>.

And, all commits go to publicly archived commit lists, no? Meaning if
you engage with mailing lists, you get access to commit activity for
free.

The only missing part though is geography, knowing where in the world a
committer is, and with the advent of powerful web mail services, this
gets harder still, unless the committer publishes a FOAF file (I think
that's what it is called?)

Upayavira

On Thu, Oct 24, 2013, at 08:44 PM, janI wrote:
> On 24 October 2013 20:24, Santiago Gala <sa...@gmail.com> wrote:
> 
> > On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson <slwilson4@wisc.edu
> > >wrote:
> >
> > > Hi Jan,
> > >
> > > My thought was that #commits would give an idea of how active the
> > > developers in that country were, in order to distinguish between a
> > country
> > > with a handful of developers that periodically commit, and a country
> > with a
> > > handful of developers that happen to be extraordinarily active ones.
> > >
> >
> 
> As I tried to explain, you will get data that cannot be compared
> statistically. But of course it still gives an indication of activity
> level.
> 
> These data are not back-end data, the apache project repos are publicly
> available, making it is possible for you to extract the repo log data.
> You
> need to cross reference the log data with data from people.apache.org.
> 
> 
> > >
> > Note that:
> > * the commits are typically surrounded by technical discussion in the devel
> > list
> > * for each commit an email is sent to a public list.
> >
> > You can reasonably infer the numbers you are looking for just using the
> > public email archive plus an analysis of email aliases and domains...
> >
> > This is the approach I decided to take in my Master Thesis, mostly to avoid
> > depending of the effort of other people...
> >
> 
> This is a very good approach, when looking for activity levels, because
> it
> includes QA, documentation and all the items around the programming.
> 
> Just to be clear, to my best knowledge, we dont have much better data
> internally, and as PCtony wrote, we would need extremely good reasons to
> provide information, which committers have chosen not to make publicly
> available.
> 
> rgds
> jan I.
> 
> 
> >
> > Regards
> > Santiago
> >
> >
> > > What I'm trying to measure is the technical capability of the population,
> > > using open source activity (both in terms of development and use) as a
> > > proxy variable. I'm definitely open to suggestions of data that is
> > > available on the backend that might work better for this, but as Tony
> > > suggested, I made my best stab at what I thought would be good measures,
> > > and something that is likely to exist in your data.
> > >
> > > Best,
> > > Steven
> > >
> > >
> > >
> > > >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu>
> > wrote:
> > > >
> > > >> Thanks for the quick reply Tony.
> > > >>
> > > >> I certainly appreciate the need for keeping the personal data private,
> > > and
> > > >> have no interest in collecting data at that level. I'm looking for
> > > country
> > > >> level data, ideally year-by-year.
> > > >>
> > > >> So as a starting point, an output like this would be ideal: country,
> > > year,
> > > >> # of developers, # of total commits. For example:
> > > >> Mexico, 2009, 105, 5213
> > > >> Mexico, 2010, 117, 5598
> > > >>
> > > >
> > > >HI
> > > >just out of curiosity, why do you think #commits is a significant value
> > ?
> > > >
> > > >I tried to make a "top 10", for one of the bigger projects
> > > >(ApacheOpenOffice), and it turned out that a couple of the most active
> > > >active committers  did not even reach "top 10". Reason was that these
> > > >committers had few commits, but each commit contained with a lot of
> > files,
> > > >where as some of the web committers tended to do a commit for every
> > file.
> > > >Btw extracting the data from svn was very network demanding.
> > > >
> > > >rgds
> > > >jan I.
> > > >
> > > >
> > > >
> > > >
> > > >> et cetera.
> > > >>
> > > >> Would that be something that would be easily extractable from the
> > > backend?
> > > >>
> > > >> Steven
> > > >>
> > > >>
> > > >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
> > > >>
> > > >>> Steven,
> > > >>>
> > > >>> Some of this information won't be so easy to get.  For example we
> > > cannot
> > > >>> tell you how many downloads each project has had, as almost all of
> > that
> > > >>> data is held locally by the mirrors and we don't currently collect
> > it.
> > > >>>
> > > >>> Other data is a little easier to collect, but I'm afraid some of it
> > is
> > > >>> likely considered personal data, so we'd almost certainly not release
> > > it to
> > > >>> a 3rd party. This is mostly because the data you have already found
> > is
> > > >>> constructed from other data, which is interspersed with some personal
> > > data.
> > > >>> A lot of the data will unfortunately be stored within the SVN
> > history.
> > > >>> However I suspect a lot of it will not be contained within the public
> > > repo,
> > > >>> though clearly some of it will be.
> > > >>>
> > > >>> The only way I can be more helpful to you, I think, is to ask you to
> > > give
> > > >>> us some specific requests for data and I can let you know if we can
> > > either
> > > >>> get that data, and if we are able to distribute it.  I realise this
> > > may not
> > > >>> be as helpful as you want, but we are prudent about releasing data.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu>
> > > wrote:
> > > >>>
> > > >>>  Hello,
> > > >>>>
> > > >>>> I first emailed the media contact with Apache and he recommended
> > that
> > > I
> > > >>>> resend to this list, with a somewhat unorthodox request for
> > > >>>> information/direction.
> > > >>>>
> > > >>>> I'm a PhD student writing a dissertation on the effects of the
> > > Internet
> > > >>>> on politics around the world. One of the variables that I'm looking
> > > at is
> > > >>>> how technically literate the populations of different countries are.
> > > The
> > > >>>> way I'm measuring this is through a variety of sources getting at
> > open
> > > >>>> source downloads, usage of open source software, etc.
> > > >>>>
> > > >>>> I've found some excellent information on Apache's site, including
> > the
> > > >>>> map of where contributors are located, so I think that somewhere
> > > behind the
> > > >>>> scenes should be the specific data that I'm looking for: the numbers
> > > of
> > > >>>> download for each project, number of contributors, number of
> > mirrors,
> > > etc.
> > > >>>> by year and country, since the start of the Apache Foundation.
> > > >>>>
> > > >>>> Could you point me in the right direction on this matter?
> > > >>>>
> > > >>>> Thanks!
> > > >>>> Steven Wilson
> > > >>>> PhD Candidate in Political Science
> > > >>>> University of Wisconsin-Madison
> > > >>>>
> > > >>>
> > > >>> Cheers,
> > > >>> Tony
> > > >>>
> > > >>> ------------------------------****----
> > >
> > > >>> Tony Stevenson
> > > >>>
> > > >>> tony@pc-tony.com
> > > >>> pctony@apache.org
> > > >>>
> > > >>> http://www.pc-tony.com
> > > >>>
> > > >>> GPG - 1024D/51047D66
> > > >>> ------------------------------****----
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> >

Re: Data question

Posted by janI <ja...@apache.org>.

On 24 October 2013 20:24, Santiago Gala <sa...@gmail.com> wrote:

> On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson <slwilson4@wisc.edu
> >wrote:
>
> > Hi Jan,
> >
> > My thought was that #commits would give an idea of how active the
> > developers in that country were, in order to distinguish between a
> country
> > with a handful of developers that periodically commit, and a country
> with a
> > handful of developers that happen to be extraordinarily active ones.
> >
>

As I tried to explain, you will get data that cannot be compared
statistically. But of course it still gives an indication of activity level.

These data are not back-end data, the apache project repos are publicly
available, making it is possible for you to extract the repo log data. You
need to cross reference the log data with data from people.apache.org.


> >
> Note that:
> * the commits are typically surrounded by technical discussion in the devel
> list
> * for each commit an email is sent to a public list.
>
> You can reasonably infer the numbers you are looking for just using the
> public email archive plus an analysis of email aliases and domains...
>
> This is the approach I decided to take in my Master Thesis, mostly to avoid
> depending of the effort of other people...
>

This is a very good approach, when looking for activity levels, because it
includes QA, documentation and all the items around the programming.

Just to be clear, to my best knowledge, we dont have much better data
internally, and as PCtony wrote, we would need extremely good reasons to
provide information, which committers have chosen not to make publicly
available.

rgds
jan I.


>
> Regards
> Santiago
>
>
> > What I'm trying to measure is the technical capability of the population,
> > using open source activity (both in terms of development and use) as a
> > proxy variable. I'm definitely open to suggestions of data that is
> > available on the backend that might work better for this, but as Tony
> > suggested, I made my best stab at what I thought would be good measures,
> > and something that is likely to exist in your data.
> >
> > Best,
> > Steven
> >
> >
> >
> > >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu>
> wrote:
> > >
> > >> Thanks for the quick reply Tony.
> > >>
> > >> I certainly appreciate the need for keeping the personal data private,
> > and
> > >> have no interest in collecting data at that level. I'm looking for
> > country
> > >> level data, ideally year-by-year.
> > >>
> > >> So as a starting point, an output like this would be ideal: country,
> > year,
> > >> # of developers, # of total commits. For example:
> > >> Mexico, 2009, 105, 5213
> > >> Mexico, 2010, 117, 5598
> > >>
> > >
> > >HI
> > >just out of curiosity, why do you think #commits is a significant value
> ?
> > >
> > >I tried to make a "top 10", for one of the bigger projects
> > >(ApacheOpenOffice), and it turned out that a couple of the most active
> > >active committers  did not even reach "top 10". Reason was that these
> > >committers had few commits, but each commit contained with a lot of
> files,
> > >where as some of the web committers tended to do a commit for every
> file.
> > >Btw extracting the data from svn was very network demanding.
> > >
> > >rgds
> > >jan I.
> > >
> > >
> > >
> > >
> > >> et cetera.
> > >>
> > >> Would that be something that would be easily extractable from the
> > backend?
> > >>
> > >> Steven
> > >>
> > >>
> > >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
> > >>
> > >>> Steven,
> > >>>
> > >>> Some of this information won't be so easy to get.  For example we
> > cannot
> > >>> tell you how many downloads each project has had, as almost all of
> that
> > >>> data is held locally by the mirrors and we don't currently collect
> it.
> > >>>
> > >>> Other data is a little easier to collect, but I'm afraid some of it
> is
> > >>> likely considered personal data, so we'd almost certainly not release
> > it to
> > >>> a 3rd party. This is mostly because the data you have already found
> is
> > >>> constructed from other data, which is interspersed with some personal
> > data.
> > >>> A lot of the data will unfortunately be stored within the SVN
> history.
> > >>> However I suspect a lot of it will not be contained within the public
> > repo,
> > >>> though clearly some of it will be.
> > >>>
> > >>> The only way I can be more helpful to you, I think, is to ask you to
> > give
> > >>> us some specific requests for data and I can let you know if we can
> > either
> > >>> get that data, and if we are able to distribute it.  I realise this
> > may not
> > >>> be as helpful as you want, but we are prudent about releasing data.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu>
> > wrote:
> > >>>
> > >>>  Hello,
> > >>>>
> > >>>> I first emailed the media contact with Apache and he recommended
> that
> > I
> > >>>> resend to this list, with a somewhat unorthodox request for
> > >>>> information/direction.
> > >>>>
> > >>>> I'm a PhD student writing a dissertation on the effects of the
> > Internet
> > >>>> on politics around the world. One of the variables that I'm looking
> > at is
> > >>>> how technically literate the populations of different countries are.
> > The
> > >>>> way I'm measuring this is through a variety of sources getting at
> open
> > >>>> source downloads, usage of open source software, etc.
> > >>>>
> > >>>> I've found some excellent information on Apache's site, including
> the
> > >>>> map of where contributors are located, so I think that somewhere
> > behind the
> > >>>> scenes should be the specific data that I'm looking for: the numbers
> > of
> > >>>> download for each project, number of contributors, number of
> mirrors,
> > etc.
> > >>>> by year and country, since the start of the Apache Foundation.
> > >>>>
> > >>>> Could you point me in the right direction on this matter?
> > >>>>
> > >>>> Thanks!
> > >>>> Steven Wilson
> > >>>> PhD Candidate in Political Science
> > >>>> University of Wisconsin-Madison
> > >>>>
> > >>>
> > >>> Cheers,
> > >>> Tony
> > >>>
> > >>> ------------------------------****----
> >
> > >>> Tony Stevenson
> > >>>
> > >>> tony@pc-tony.com
> > >>> pctony@apache.org
> > >>>
> > >>> http://www.pc-tony.com
> > >>>
> > >>> GPG - 1024D/51047D66
> > >>> ------------------------------****----
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> >
>

Re: Data question

Posted by Santiago Gala <sa...@gmail.com>.

On Thu, Oct 24, 2013 at 7:35 PM, Steven Lloyd Wilson <sl...@wisc.edu>wrote:

> Hi Jan,
>
> My thought was that #commits would give an idea of how active the
> developers in that country were, in order to distinguish between a country
> with a handful of developers that periodically commit, and a country with a
> handful of developers that happen to be extraordinarily active ones.
>
>
Note that:
* the commits are typically surrounded by technical discussion in the devel
list
* for each commit an email is sent to a public list.

You can reasonably infer the numbers you are looking for just using the
public email archive plus an analysis of email aliases and domains...

This is the approach I decided to take in my Master Thesis, mostly to avoid
depending of the effort of other people...

Regards
Santiago


> What I'm trying to measure is the technical capability of the population,
> using open source activity (both in terms of development and use) as a
> proxy variable. I'm definitely open to suggestions of data that is
> available on the backend that might work better for this, but as Tony
> suggested, I made my best stab at what I thought would be good measures,
> and something that is likely to exist in your data.
>
> Best,
> Steven
>
>
>
> >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu> wrote:
> >
> >> Thanks for the quick reply Tony.
> >>
> >> I certainly appreciate the need for keeping the personal data private,
> and
> >> have no interest in collecting data at that level. I'm looking for
> country
> >> level data, ideally year-by-year.
> >>
> >> So as a starting point, an output like this would be ideal: country,
> year,
> >> # of developers, # of total commits. For example:
> >> Mexico, 2009, 105, 5213
> >> Mexico, 2010, 117, 5598
> >>
> >
> >HI
> >just out of curiosity, why do you think #commits is a significant value ?
> >
> >I tried to make a "top 10", for one of the bigger projects
> >(ApacheOpenOffice), and it turned out that a couple of the most active
> >active committers  did not even reach "top 10". Reason was that these
> >committers had few commits, but each commit contained with a lot of files,
> >where as some of the web committers tended to do a commit for every file.
> >Btw extracting the data from svn was very network demanding.
> >
> >rgds
> >jan I.
> >
> >
> >
> >
> >> et cetera.
> >>
> >> Would that be something that would be easily extractable from the
> backend?
> >>
> >> Steven
> >>
> >>
> >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
> >>
> >>> Steven,
> >>>
> >>> Some of this information won't be so easy to get.  For example we
> cannot
> >>> tell you how many downloads each project has had, as almost all of that
> >>> data is held locally by the mirrors and we don't currently collect it.
> >>>
> >>> Other data is a little easier to collect, but I'm afraid some of it is
> >>> likely considered personal data, so we'd almost certainly not release
> it to
> >>> a 3rd party. This is mostly because the data you have already found is
> >>> constructed from other data, which is interspersed with some personal
> data.
> >>> A lot of the data will unfortunately be stored within the SVN history.
> >>> However I suspect a lot of it will not be contained within the public
> repo,
> >>> though clearly some of it will be.
> >>>
> >>> The only way I can be more helpful to you, I think, is to ask you to
> give
> >>> us some specific requests for data and I can let you know if we can
> either
> >>> get that data, and if we are able to distribute it.  I realise this
> may not
> >>> be as helpful as you want, but we are prudent about releasing data.
> >>>
> >>>
> >>>
> >>>
> >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu>
> wrote:
> >>>
> >>>  Hello,
> >>>>
> >>>> I first emailed the media contact with Apache and he recommended that
> I
> >>>> resend to this list, with a somewhat unorthodox request for
> >>>> information/direction.
> >>>>
> >>>> I'm a PhD student writing a dissertation on the effects of the
> Internet
> >>>> on politics around the world. One of the variables that I'm looking
> at is
> >>>> how technically literate the populations of different countries are.
> The
> >>>> way I'm measuring this is through a variety of sources getting at open
> >>>> source downloads, usage of open source software, etc.
> >>>>
> >>>> I've found some excellent information on Apache's site, including the
> >>>> map of where contributors are located, so I think that somewhere
> behind the
> >>>> scenes should be the specific data that I'm looking for: the numbers
> of
> >>>> download for each project, number of contributors, number of mirrors,
> etc.
> >>>> by year and country, since the start of the Apache Foundation.
> >>>>
> >>>> Could you point me in the right direction on this matter?
> >>>>
> >>>> Thanks!
> >>>> Steven Wilson
> >>>> PhD Candidate in Political Science
> >>>> University of Wisconsin-Madison
> >>>>
> >>>
> >>> Cheers,
> >>> Tony
> >>>
> >>> ------------------------------****----
>
> >>> Tony Stevenson
> >>>
> >>> tony@pc-tony.com
> >>> pctony@apache.org
> >>>
> >>> http://www.pc-tony.com
> >>>
> >>> GPG - 1024D/51047D66
> >>> ------------------------------****----
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
>

Re: Data question

Posted by Steven Lloyd Wilson <sl...@wisc.edu>.

Hi Jan,

My thought was that #commits would give an idea of how active the 
developers in that country were, in order to distinguish between a 
country with a handful of developers that periodically commit, and a 
country with a handful of developers that happen to be extraordinarily 
active ones.

What I'm trying to measure is the technical capability of the 
population, using open source activity (both in terms of development and 
use) as a proxy variable. I'm definitely open to suggestions of data 
that is available on the backend that might work better for this, but as 
Tony suggested, I made my best stab at what I thought would be good 
measures, and something that is likely to exist in your data.

Best,
Steven


 >On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu> wrote:
 >
 >> Thanks for the quick reply Tony.
 >>
 >> I certainly appreciate the need for keeping the personal data 
private, and
 >> have no interest in collecting data at that level. I'm looking for 
country
 >> level data, ideally year-by-year.
 >>
 >> So as a starting point, an output like this would be ideal: country, 
year,
 >> # of developers, # of total commits. For example:
 >> Mexico, 2009, 105, 5213
 >> Mexico, 2010, 117, 5598
 >>
 >
 >HI
 >just out of curiosity, why do you think #commits is a significant value ?
 >
 >I tried to make a "top 10", for one of the bigger projects
 >(ApacheOpenOffice), and it turned out that a couple of the most active
 >active committers  did not even reach "top 10". Reason was that these
 >committers had few commits, but each commit contained with a lot of files,
 >where as some of the web committers tended to do a commit for every file.
 >Btw extracting the data from svn was very network demanding.
 >
 >rgds
 >jan I.
 >
 >
 >
 >
 >> et cetera.
 >>
 >> Would that be something that would be easily extractable from the 
backend?
 >>
 >> Steven
 >>
 >>
 >> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
 >>
 >>> Steven,
 >>>
 >>> Some of this information won't be so easy to get.  For example we 
cannot
 >>> tell you how many downloads each project has had, as almost all of that
 >>> data is held locally by the mirrors and we don't currently collect it.
 >>>
 >>> Other data is a little easier to collect, but I'm afraid some of it is
 >>> likely considered personal data, so we'd almost certainly not 
release it to
 >>> a 3rd party. This is mostly because the data you have already found is
 >>> constructed from other data, which is interspersed with some 
personal data.
 >>> A lot of the data will unfortunately be stored within the SVN history.
 >>> However I suspect a lot of it will not be contained within the 
public repo,
 >>> though clearly some of it will be.
 >>>
 >>> The only way I can be more helpful to you, I think, is to ask you 
to give
 >>> us some specific requests for data and I can let you know if we can 
either
 >>> get that data, and if we are able to distribute it.  I realise this 
may not
 >>> be as helpful as you want, but we are prudent about releasing data.
 >>>
 >>>
 >>>
 >>>
 >>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu> 
wrote:
 >>>
 >>>  Hello,
 >>>>
 >>>> I first emailed the media contact with Apache and he recommended 
that I
 >>>> resend to this list, with a somewhat unorthodox request for
 >>>> information/direction.
 >>>>
 >>>> I'm a PhD student writing a dissertation on the effects of the 
Internet
 >>>> on politics around the world. One of the variables that I'm 
looking at is
 >>>> how technically literate the populations of different countries 
are. The
 >>>> way I'm measuring this is through a variety of sources getting at open
 >>>> source downloads, usage of open source software, etc.
 >>>>
 >>>> I've found some excellent information on Apache's site, including the
 >>>> map of where contributors are located, so I think that somewhere 
behind the
 >>>> scenes should be the specific data that I'm looking for: the 
numbers of
 >>>> download for each project, number of contributors, number of 
mirrors, etc.
 >>>> by year and country, since the start of the Apache Foundation.
 >>>>
 >>>> Could you point me in the right direction on this matter?
 >>>>
 >>>> Thanks!
 >>>> Steven Wilson
 >>>> PhD Candidate in Political Science
 >>>> University of Wisconsin-Madison
 >>>>
 >>>
 >>> Cheers,
 >>> Tony
 >>>
 >>> ------------------------------**----
 >>> Tony Stevenson
 >>>
 >>> tony@pc-tony.com
 >>> pctony@apache.org
 >>>
 >>> http://www.pc-tony.com
 >>>
 >>> GPG - 1024D/51047D66
 >>> ------------------------------**----
 >>>
 >>>
 >>>
 >>>
 >>>
 >>>
 >>>
 >>>
 >>

Re: Data question

Posted by Rob Weir <ro...@apache.org>.

On Tue, Oct 22, 2013 at 2:13 AM, Tony Stevenson <to...@pc-tony.com> wrote:
> Steven,
>
> Some of this information won't be so easy to get.  For example we cannot tell you how many downloads each project has had, as almost all of that data is held locally by the mirrors and we don't currently collect it.
>

One exception is the Apache OpenOffice, where we distribute install
sets primarily via the SourceForge network.  In that case we do have
daily download numbers, including breakdown by country, operating
system, etc.   A summary report of downloads by country is here:

http://www.openoffice.org/stats/countries.html

SourceForge also has a REST API that you can query to generate other reports.

Regards,

-Rob

> Other data is a little easier to collect, but I'm afraid some of it is likely considered personal data, so we'd almost certainly not release it to a 3rd party. This is mostly because the data you have already found is constructed from other data, which is interspersed with some personal data. A lot of the data will unfortunately be stored within the SVN history. However I suspect a lot of it will not be contained within the public repo, though clearly some of it will be.
>
> The only way I can be more helpful to you, I think, is to ask you to give us some specific requests for data and I can let you know if we can either get that data, and if we are able to distribute it.  I realise this may not be as helpful as you want, but we are prudent about releasing data.
>
>
>
>
> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu> wrote:
>
>> Hello,
>>
>> I first emailed the media contact with Apache and he recommended that I resend to this list, with a somewhat unorthodox request for information/direction.
>>
>> I'm a PhD student writing a dissertation on the effects of the Internet on politics around the world. One of the variables that I'm looking at is how technically literate the populations of different countries are. The way I'm measuring this is through a variety of sources getting at open source downloads, usage of open source software, etc.
>>
>> I've found some excellent information on Apache's site, including the map of where contributors are located, so I think that somewhere behind the scenes should be the specific data that I'm looking for: the numbers of download for each project, number of contributors, number of mirrors, etc. by year and country, since the start of the Apache Foundation.
>>
>> Could you point me in the right direction on this matter?
>>
>> Thanks!
>> Steven Wilson
>> PhD Candidate in Political Science
>> University of Wisconsin-Madison
>
>
> Cheers,
> Tony
>
> ----------------------------------
> Tony Stevenson
>
> tony@pc-tony.com
> pctony@apache.org
>
> http://www.pc-tony.com
>
> GPG - 1024D/51047D66
> ----------------------------------
>
>
>
>
>
>

Re: Data question

Posted by janI <ja...@apache.org>.

On 23 October 2013 03:14, Steven Lloyd Wilson <sl...@wisc.edu> wrote:

> Thanks for the quick reply Tony.
>
> I certainly appreciate the need for keeping the personal data private, and
> have no interest in collecting data at that level. I'm looking for country
> level data, ideally year-by-year.
>
> So as a starting point, an output like this would be ideal: country, year,
> # of developers, # of total commits. For example:
> Mexico, 2009, 105, 5213
> Mexico, 2010, 117, 5598
>

HI
just out of curiosity, why do you think #commits is a significant value ?

I tried to make a "top 10", for one of the bigger projects
(ApacheOpenOffice), and it turned out that a couple of the most active
active committers  did not even reach "top 10". Reason was that these
committers had few commits, but each commit contained with a lot of files,
where as some of the web committers tended to do a commit for every file.
Btw extracting the data from svn was very network demanding.

rgds
jan I.




> et cetera.
>
> Would that be something that would be easily extractable from the backend?
>
> Steven
>
>
> On 10/22/2013 02:13 AM, Tony Stevenson wrote:
>
>> Steven,
>>
>> Some of this information won't be so easy to get.  For example we cannot
>> tell you how many downloads each project has had, as almost all of that
>> data is held locally by the mirrors and we don't currently collect it.
>>
>> Other data is a little easier to collect, but I'm afraid some of it is
>> likely considered personal data, so we'd almost certainly not release it to
>> a 3rd party. This is mostly because the data you have already found is
>> constructed from other data, which is interspersed with some personal data.
>> A lot of the data will unfortunately be stored within the SVN history.
>> However I suspect a lot of it will not be contained within the public repo,
>> though clearly some of it will be.
>>
>> The only way I can be more helpful to you, I think, is to ask you to give
>> us some specific requests for data and I can let you know if we can either
>> get that data, and if we are able to distribute it.  I realise this may not
>> be as helpful as you want, but we are prudent about releasing data.
>>
>>
>>
>>
>> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu> wrote:
>>
>>  Hello,
>>>
>>> I first emailed the media contact with Apache and he recommended that I
>>> resend to this list, with a somewhat unorthodox request for
>>> information/direction.
>>>
>>> I'm a PhD student writing a dissertation on the effects of the Internet
>>> on politics around the world. One of the variables that I'm looking at is
>>> how technically literate the populations of different countries are. The
>>> way I'm measuring this is through a variety of sources getting at open
>>> source downloads, usage of open source software, etc.
>>>
>>> I've found some excellent information on Apache's site, including the
>>> map of where contributors are located, so I think that somewhere behind the
>>> scenes should be the specific data that I'm looking for: the numbers of
>>> download for each project, number of contributors, number of mirrors, etc.
>>> by year and country, since the start of the Apache Foundation.
>>>
>>> Could you point me in the right direction on this matter?
>>>
>>> Thanks!
>>> Steven Wilson
>>> PhD Candidate in Political Science
>>> University of Wisconsin-Madison
>>>
>>
>> Cheers,
>> Tony
>>
>> ------------------------------**----
>> Tony Stevenson
>>
>> tony@pc-tony.com
>> pctony@apache.org
>>
>> http://www.pc-tony.com
>>
>> GPG - 1024D/51047D66
>> ------------------------------**----
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Data question

Posted by Steven Lloyd Wilson <sl...@wisc.edu>.

Thanks for the quick reply Tony.

I certainly appreciate the need for keeping the personal data private, 
and have no interest in collecting data at that level. I'm looking for 
country level data, ideally year-by-year.

So as a starting point, an output like this would be ideal: country, 
year, # of developers, # of total commits. For example:
Mexico, 2009, 105, 5213
Mexico, 2010, 117, 5598

et cetera.

Would that be something that would be easily extractable from the backend?

Steven


On 10/22/2013 02:13 AM, Tony Stevenson wrote:
> Steven,
>
> Some of this information won't be so easy to get.  For example we cannot tell you how many downloads each project has had, as almost all of that data is held locally by the mirrors and we don't currently collect it.
>
> Other data is a little easier to collect, but I'm afraid some of it is likely considered personal data, so we'd almost certainly not release it to a 3rd party. This is mostly because the data you have already found is constructed from other data, which is interspersed with some personal data. A lot of the data will unfortunately be stored within the SVN history. However I suspect a lot of it will not be contained within the public repo, though clearly some of it will be.
>
> The only way I can be more helpful to you, I think, is to ask you to give us some specific requests for data and I can let you know if we can either get that data, and if we are able to distribute it.  I realise this may not be as helpful as you want, but we are prudent about releasing data.
>
>
>
>
> On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu> wrote:
>
>> Hello,
>>
>> I first emailed the media contact with Apache and he recommended that I resend to this list, with a somewhat unorthodox request for information/direction.
>>
>> I'm a PhD student writing a dissertation on the effects of the Internet on politics around the world. One of the variables that I'm looking at is how technically literate the populations of different countries are. The way I'm measuring this is through a variety of sources getting at open source downloads, usage of open source software, etc.
>>
>> I've found some excellent information on Apache's site, including the map of where contributors are located, so I think that somewhere behind the scenes should be the specific data that I'm looking for: the numbers of download for each project, number of contributors, number of mirrors, etc. by year and country, since the start of the Apache Foundation.
>>
>> Could you point me in the right direction on this matter?
>>
>> Thanks!
>> Steven Wilson
>> PhD Candidate in Political Science
>> University of Wisconsin-Madison
>
> Cheers,
> Tony
>
> ----------------------------------
> Tony Stevenson
>
> tony@pc-tony.com
> pctony@apache.org
>
> http://www.pc-tony.com
>
> GPG - 1024D/51047D66
> ----------------------------------
>
>
>
>
>
>
>

Re: Data question

Posted by Tony Stevenson <to...@pc-tony.com>.

Steven, 

Some of this information won't be so easy to get.  For example we cannot tell you how many downloads each project has had, as almost all of that data is held locally by the mirrors and we don't currently collect it.

Other data is a little easier to collect, but I'm afraid some of it is likely considered personal data, so we'd almost certainly not release it to a 3rd party. This is mostly because the data you have already found is constructed from other data, which is interspersed with some personal data. A lot of the data will unfortunately be stored within the SVN history. However I suspect a lot of it will not be contained within the public repo, though clearly some of it will be.

The only way I can be more helpful to you, I think, is to ask you to give us some specific requests for data and I can let you know if we can either get that data, and if we are able to distribute it.  I realise this may not be as helpful as you want, but we are prudent about releasing data. 

On 22 Oct 2013, at 03:03, Steven Lloyd Wilson <sl...@wisc.edu> wrote:

> Hello,
> 
> I first emailed the media contact with Apache and he recommended that I resend to this list, with a somewhat unorthodox request for information/direction.
> 
> I'm a PhD student writing a dissertation on the effects of the Internet on politics around the world. One of the variables that I'm looking at is how technically literate the populations of different countries are. The way I'm measuring this is through a variety of sources getting at open source downloads, usage of open source software, etc.
> 
> I've found some excellent information on Apache's site, including the map of where contributors are located, so I think that somewhere behind the scenes should be the specific data that I'm looking for: the numbers of download for each project, number of contributors, number of mirrors, etc. by year and country, since the start of the Apache Foundation.
> 
> Could you point me in the right direction on this matter?
> 
> Thanks!
> Steven Wilson
> PhD Candidate in Political Science
> University of Wisconsin-Madison

Cheers,
Tony

----------------------------------
Tony Stevenson

tony@pc-tony.com
pctony@apache.org

http://www.pc-tony.com

GPG - 1024D/51047D66
----------------------------------