You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@diversity.apache.org by Justin Mclean <ju...@classsoftware.com> on 2019/06/21 04:28:58 UTC

Gender of users on GitHub Apache organisation

Hi,

Something I thought might be of interest. I took all of the names of people in the Apache GitHub organisation [1] and ran them through a gender guesser.

Here’s the results:
Total records: 2689
Male: 61.7%
Female: 3.9%
Unknown: 25.3%
No name: 9.1%

You might get more accurate results by also looking at GitHub icons, as about 1/2 of those have photos.

So you can see just from someone’s first name you have a good chance of knowing what's their gender is, even if you may not realise it. and some people might be sharing that information when they don’t think they are.

Now the above is not likely to be entirely accurate for a number ion reasons, for instance people may not put their real names down, it is however a large sample size of our committers. This represents 40% of our committers (current total 6946) and perhaps a higher % of currently active committers.

It might be interesting to run this over time and see if it changes, or perhaps use as a rough baseline for any survey results to see if a group is under or over represented in the answers.

The last committer survey [2] In which 765 people responded had 92% male / 5% female + 3% other. Taking the male/female results above (and assuming unknown/no name is at same ratio which may be a big assumption) we get 94% male / 6% female which is (very) close to those results.

If people want the code or more details how exactly I got these values, I’m happy to share.

Thanks,
Justin

1. https://github.com/orgs/apache/people
2. https://cwiki.apache.org/confluence/display/COMDEV/ASF+Committer+Diversity+Survey+-+2016

Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

Committee rosters:
Total: 4738
% Male: 75.3
% Female: 4.8
% Unknown: 19.9

The above is interesting e.g. Is that % drop just chance or an example of unconscious bias or something else like time commitment required? If it's not obvious this data over the previous represents 2219 people who have signed ICLAs but have never for one reason or another never become a committer. Committer vs PMC breakdown may also be interesting.

Chairs:
Total: 213
% Male: 78.9
% Female: 6.1
% Unknown: 15.0

And again slightly different percentages.

What perhaps more striking is the unknown % being reduced, from 25% to 20% (and to 15%), that seem indicative of something.The gender guesser I’m using [1] has a database of 40,000 names and includes multiple nationalities but it could be favouring some sort of names over others and as could we. Looking further into the unknown names may give some insights.

If anyone has any better insights into this, or can suggest what to look for, please speak up.

Thanks,
Justin

1. https://pypi.org/project/gender-guesser/



Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

> https://whimsy.apache.org/public/icla-info.json

Way ahead of you :-)

And here’s the result:

Total: 6957
% Male: 69.8
% Female: 5.7
% Unknown: 24.5

Note the female % is a tiny bit higher, perhaps (and I only guessing here) because a) they are more likely to hide their identity on GitHub b) more likely to be involved in non code donations?

Thanks,
Justin



Re: Gender of users on GitHub Apache organisation

Posted by Andrew Musselman <ak...@apache.org>.
Thanks, did not see that link.

On Fri, Jun 21, 2019 at 20:32 Justin Mclean <ju...@classsoftware.com>
wrote:

> Hi,
>
> > Thanks for taking an initial look Justin. Can you share what name guesser
> > you're using?
>
> I already did [1], it has a database of 40,000 names. Is it 100% correct?
> Probably not and some names can be either gender, but it probably gives a
> good indicator and is the same rough percentages as the committer survey
> taken a few years back.
>
> > Can I also ask if we are sure our community wants their "name" on GitHub
> > used in stats on "gender?”
>
> IMO I think as long as it’s aggregated there’s no harm, we not identifying
> anyone and groups have 100s or 1000s of people in them, but if anyone
> disagrees please speak up.
>
> BTW Whimsy has more accurate data and lists everyone so I’ve moved to
> using that data, rather than GitHub’s.
>
> > I tend toward not using data that wasn't intended for a purpose for a
> > purpose without letting people know, especially if we are planning to
> > publish figures.
>
> If this was published anywhere in detail, there would be careful
> consideration and discussion before doing so, currently it's just back of
> the envelope numbers based on some data we have, that may provide some
> insights.
>
> I think it’s confirms what we already know i.e. that the gender mix in our
> committer base is not the same as it is in others open source foundations,
> employment in ITC or the general population.
>
> Thanks,
> Justin
>
> 1.  https://pypi.org/project/gender-guesser/
>
>

Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

> I am adding Holden Karau, who's done similar analysis and has a script she
> was interested in sharing to expand collaboration.

Thanks for that, it would be interesting to know what approach was taken. I’ve met Holden in person before, and she might remember me.

Thanks,
Justin

Re: Gender of users on GitHub Apache organisation

Posted by Griselda Cuevas <gr...@google.com.INVALID>.
I am adding Holden Karau, who's done similar analysis and has a script she
was interested in sharing to expand collaboration.

On Fri, Jun 21, 2019, 9:33 PM Justin Mclean <ju...@classsoftware.com>
wrote:

> Hi,
>
> BTW I did generate the stats on a PMC basis and a few other ways, but
> don’t think it would be useful to post that here, given that roughly 5% of
> people are one gender, one or two people either way will make large swings
> in most PMCs. It did indicate that there are few PMC are probably more
> diverse than others with regards to gender, you might be able to guess who
> they are. (Hint this list is one of them).
>
> Thanks,
> Justin

Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

BTW I did generate the stats on a PMC basis and a few other ways, but don’t think it would be useful to post that here, given that roughly 5% of people are one gender, one or two people either way will make large swings in most PMCs. It did indicate that there are few PMC are probably more diverse than others with regards to gender, you might be able to guess who they are. (Hint this list is one of them).

Thanks,
Justin

Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

> Thanks for taking an initial look Justin. Can you share what name guesser
> you're using?

I already did [1], it has a database of 40,000 names. Is it 100% correct? Probably not and some names can be either gender, but it probably gives a good indicator and is the same rough percentages as the committer survey taken a few years back.

> Can I also ask if we are sure our community wants their "name" on GitHub
> used in stats on "gender?”

IMO I think as long as it’s aggregated there’s no harm, we not identifying anyone and groups have 100s or 1000s of people in them, but if anyone disagrees please speak up.

BTW Whimsy has more accurate data and lists everyone so I’ve moved to using that data, rather than GitHub’s.

> I tend toward not using data that wasn't intended for a purpose for a
> purpose without letting people know, especially if we are planning to
> publish figures.

If this was published anywhere in detail, there would be careful consideration and discussion before doing so, currently it's just back of the envelope numbers based on some data we have, that may provide some insights.

I think it’s confirms what we already know i.e. that the gender mix in our committer base is not the same as it is in others open source foundations, employment in ITC or the general population.

Thanks,
Justin

1.  https://pypi.org/project/gender-guesser/


Re: Gender of users on GitHub Apache organisation

Posted by Andrew Musselman <ak...@apache.org>.
Thanks for taking an initial look Justin. Can you share what name guesser
you're using?

Can I also ask if we are sure our community wants their "name" on GitHub
used in stats on "gender?"

I tend toward not using data that wasn't intended for a purpose for a
purpose without letting people know, especially if we are planning to
publish figures.

On Fri, Jun 21, 2019 at 18:32 Justin Mclean <ju...@me.com> wrote:

> Hi,
>
> BTW “unknown” doesn’t always mean the name is unknown, just that the
> gender may not be easily assumed from the name. IF I get time I see if I
> can break those two groups apart.
>
> Thanks,
> Justin

Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@me.com>.
Hi,

BTW “unknown” doesn’t always mean the name is unknown, just that the gender may not be easily assumed from the name. IF I get time I see if I can break those two groups apart.

Thanks,
Justin

Re: Gender of users on GitHub Apache organisation

Posted by Sam Ruby <ru...@intertwingly.net>.
On Fri, Jun 21, 2019 at 7:35 PM Justin Mclean <ju...@classsoftware.com> wrote:
>
> Although I just realised that whimsey data probably has the names of all committers and may be an easer source of that data.

https://whimsy.apache.org/public/icla-info.json

This is good work. Ideally, the scripts would be installed on either
the comdev or whimsy vm's and produce a live web page of results.

> Thanks,
> Justin
>
> 1. https://github.com/appeler/ethnicolr

Re: Gender of users on GitHub Apache organisation

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

You also also predict the ethnic group based on name using [1] and again looking at a name you might be able to have a good guess of what ethic group someone belongs to.

How useful its this? Probably not very, but perhaps changes over time may tell us something interesting? Or perhaps some guesstimates of how many committers don’t speak English as their first language? Or perhaps it could indicate we are in some respects we are somewhat diverse bunch of people?

Obviously European is going to include European sounding names from the USA, so don’t feel left out USA. Having a Jewish last name may not mean you are Jewish, having an East European name doesn’t mean you are from there, just that one of your ancestors may of been etc etc etc

But for interest, here's a simplified output from that script from all the names in Apache’s GitHub organisation:
European 710
East Asian 308
Indian 258
Jewish 189
East Europe 183
French 103
Germanic 102
Hispanic 99
Nordic 89
Italian 75
African 70
Japanese 52
Other 1

Again if anyone want details on how I generated the above just ask.

Although I just realised that whimsey data probably has the names of all committers and may be an easer source of that data.

Thanks,
Justin

1. https://github.com/appeler/ethnicolr