When Open Source and Open Science go hand-in-hand

With over 200 million projects, Github is the communal storage space of the open source world – and in a sense, the foundation of our digital society. The hosting service for software development and version control is involved in countless pieces of software we use daily. With so many software projects, and so many active developers, GitHub is a gold mine for researchers wanting to understand the open source software world. But getting good information has always been hard. That changed when Georgios Gousios developed GHTorrent, a dataset that opens a window into the inner workings of Github.

In 2021 the European Commission issued a report with a not very poetic title: ‘The impact of open source software and hardware on technological independence, competitiveness and innovation in the EU economy’. Its results were very noteworthy, however. Taking everything into account, the researchers found that for every million euros the EU invests into open source software, economic returns total 60 million euros. “That is an incredible finding,” admits Georgios Gousios, perhaps with a sense of pride. Not only because this underlines his belief that software engineering, especially open source, should get more attention, but also because his GHTorrent dataset has been instrumental in gaining this economic insight.”

Within one year we had an active userbase – light-speed by academic standards.

“By the time the lead researcher for the European Commission reached out to me, I was already somewhat used to seeing GHTorrent being used for interesting analyses. In fact, I started working on GHTorrent in late 2011, and by late 2012 it already had an active userbase – which is light-speed by academic standards.”
Just as striking gold in the past would start a gold rush, attracting thousands of prospectors, GHTorrent unleashed a minor gold rush of its own, creating an entire global community of data-miners. Gousios: “Some of these people even became friends.”

Founding a mining community

Making GHTorrent open to the public, and opening up a wealth of new information, has enabled researchers to study a plethora of subjects and to gain completely new insights into how the open source community acts. The dataset shines a light on what types of collaboration work very well, but also shows how ‘offline’ human biases work their way through into the online world.

One study used GHTorrent to see how different nationalities deal with each other when collaborating on open source software. Gousios: “It is a shame, really, but you can clearly see that people from countries with political tensions are less productive together, with more thoroughly scrutinizing of each other’s contributions.” Another study showed that contributions made by female developers are less well received than contributions by men, being placed under more scrutiny or outright denied. This was not the case when the gender could not be deduced.

Yet for Gousios, this opens a way forward: “You need to know where your problems lie if you want to address them. I have also helped Microsoft in using GHTorrent, actually to the point where they now use their own variant of it.”

Gousios: “The thing is – and research on the GHTorrent data showcases this better than anything – that technical quality is not the only important factor when accepting code contributions in an open source setting. Biases, along with less malevolent factors such as personal preferences, aesthetics, and language barriers, are equally important.” For Gousios, that can only mean one thing: if we want to improve the quality of open source software – which he estimates could be 75% of the software we use – we also need to address these personal factors. Finding them where we can is the first step. “That is something that GHTorrent could improve on, actually. However, it is very hard to deliver data such as location or gender, as Github does not necessarily deliver that data.”

If you want to do science, you should do it completely open.

Seeing all this relevant data, and the bloom of a small community, really hammered in the motto of Gousios’s mentor Diomidis Spinellis (with whom he won the 10 year most influential paper award for GHTorrent at the MSR 2022 conference): if you want to do science, you should do it open to the public. Even from the very beginning.

Connecting the data

Diomidis was much more than just a spiritual mentor for Gousios. “His design, especially in dealing with MySQL (software that manages the relations between data entries) is fundamental to GHTorrent, having remained virtually unchanged for 10 years.” Whatever happens on Github, whether someone changes code or two users post messages in a discussion, there is what is called an ‘event’. The first step of the programme is to collect all these different events. “But that’s only half the work, and arguably the easiest part,” says Gousios. “The trick is to make that data useful, by giving it a context and relating it to other data entries. For example, if an event has registered the denial of someone’s coding contribution, that information only becomes useful if you know the context, so you can understand why that contribution has been denied.” What sets GHTorrent apart is its ability to continuously connect ‘raw data’ with meaningful links, supplying – in a sense – extra data.  Supplying meaningful links to each and every one of those events across 83 million developers who are working together is no small feat.

Gousios: “Initially, people using GHTorrent were a little oblivious to the amount of data they were getting. Myself included, if I am being totally honest. Even personal data, such as e-mail addresses or full names, were included.” In fact, it took a good four years, until 2016, for problems to appear. “One user brought our attention to the fact that we shared privacy-sensitive data. This was already public information, mind you, we only connected the dots. But it did make us change how GHTorrent works: it still collects that data, but no longer shares it with its users.”

It all cost me a huge amount of extra time, but also learned a huge amount of stuff.

The number of users also started to create another problem, especially between 2012 and 2014 when the GitHub started to grow enormously. Gousios: "It seemed almost exponential. I suspect it was a network effect: as more developers became active on GitHub, more of their developer friends wanted to join in. That was great for the value of the dataset, but it also meant that I had to continuously upscale a platform that was growing exponentially. It really put a strain on my abilities, and I had to quickly become an expert in technologies like MySQL and MongoDB. I had to start using distributed systems, or making certain processes restartable so that GHTorrent could deal with failures. And when one of those failures inevitably happened, and the platform had to restart, I had to prevent it from re-downloading all the data – which took some extra coding."

In other words, without having necessarily asked for it, Gousios became a ‘One Man Service Reliability Engineer’, which is a specific type of developer. “This cost me a huge amount of time, but I have also learned a huge amount of stuff.” Gousios explains that he could use many of the experiences he gained during these two years of growth in teaching. In 2016 he took over responsibility for the Big Data Processing course, revamping the material so that it included all state-of-the-art technologies.

“I have always felt very well supported by my colleagues here at TU Delft, and dealing with those difficulties could be quite stressful at times.” With a little smile he adds: "However, in retrospect I feel that I have definitely learned something through it: I cannot be an expert on everything, so best leave areas of expertise to the experts.”

“When I initially came to Delft, GHTorrent was not yet finished, but I was still given all the time and freedom I needed to finish it properly, and even maintain it afterwards.” Even when Gousios worked at the university of Nijmegen for a while, he was still allowed to host the programme on TU Delft’s hardware. Microsoft subsequently took over hosting and financial responsibility, but from 2020 GHTorrent came ‘back home’ to be hosted by TU Delft. “So, for GHTorrent’s entire journey, TU Delft has been instrumental to its success, for which I am very thankful.”