09 March 2016

EigenCoder: Programming Stereotypes

There are a lot of stereotypes in the programming community. "Swift is used by a bunch of bearded hipsters." "C++ is for old people." "No one likes coding in Java." Well it turns out that some of these might be true.

Approach

GitHub is likely the most popular open-source hosting platform in use today. GitHub has many open-source repositories for code written in wide variety of languages. They also provide "trending" pages which show you repositories in a particular language that are particularly popular right now.

These popular repositories also include the profile pictures of some of the most prolific committers on these projects. Meaning that we can easily get a few dozen profile pictures of some of the busiest contributors to popular projects for any given language.

We also have access to a neat resource from Microsoft's Project Oxford called the "Face API" which, among other things, can detect a face in a given image and estimate some properties about the face. Is the subject smiling? What gender and age are they? Do they have facial hair?

Combined, we can get an estimate of some interesting properties about the profile pictures of programmers who are working in various languages. Let's get started!

It should be noted that this is super non-scientific. Who knows how accurate the Face API is or how accurately a user's GitHub profile picture maps to any aspect of their personality/identity. It's also unclear whether the most prolific contributors to popular repositories accurately represent a community. Also, small sample sizes. Etc., etc.

All code used in this post is available here: https://github.com/trestletech/eigencoder

Data

GitHub lists 25 repositories on its trending page and shows the top 5 committers for each. Some projects don't have 5 contributors, so fewer are shown. We remove duplicated usernames then send each of these profile pictures (up to 125 per language) to be analyzed by the Face API. Of course, not all pictures have (detectable) faces in them.

In total, we get the following:

Lang FacesDetected
ruby 71
r 38
javascript 60
java 47
html 59
go 53
cpp 34
c 24
python 49
php 66
perl 45
swift 49
csharp 61

Gender

One of the properties returned by the Face API is a prediction of the gender of the subject of the photo. The results are pretty discouraging for anyone who's not a chauvinist...

Age

Age is an interesting trend. Some people assume that "old-school" languages are only used by old people and that new, trendy languages are used by hipsters. It turns out that's not always true; Java, for instance, has the lowest median age.

Smiles

Every programmer has a language that makes them miserable. So miserable, perhaps, that you can't even muster a smile for your GitHub profile picture.

The Face API returns a score from 0 to 1 approximating the amount that you're smiling. Programmers using certain languages seem happier than others. Maybe R programmers are just smiling about the crazy market for data scientists in this economy...

Facial Hair

If you've been coding for any length of time, you've met at least one mustached fellow riding a fixie and wearing skinny jeans who won't stop talking about Swift. Turns out that's a real stereotype.

I did not normalize for gender here.

Or we can look at each facial hair property returned by the Face API individually.

Conclusion

I guess to grow a mustache if you want to contribute to a successful C++ project?

In reality, you'd need a much larger sample size and some guarantees around the accuracy of the Face API before you could have any real confidence about the conclusions you draw from this data.