As part of a mission to measure the productivity of AI-assisted developers, researchers at GitHub recently conducted an experiment comparing the coding speeds of a group using its Copilot code completion tool against a group relying solely on human capabilities.
GitHub Copilot is an AI pair programming service that launched publicly earlier this year for $10 per user per month or $100 per user per year. Ever since their launch, researchers have been curious whether these AI tools actually translate to increased developer productivity. The catch is that it’s not easy to identify the right metrics to measure performance changes.
Copilot is used as an extension of code editors, such as Microsoft’s VS Code. It generates code suggestions in multiple programming languages that users can accept, reject, or modify. Suggestions are provided by the OpenAI Codex, a system that translates natural language into code and is based on OpenAI’s GPT-3 language model.
SEE: What is coding and what is it used for? A beginner’s guide
Google Research and the Google Brain team concluded in July, after studying the impact of AI code suggestions on the productivity of more than 10,000 of its own developers, that the the debate over the relative speed of performance remains an “open questiondespite finding that a combination of traditional rule-based semantic engines and large language models, such as Codex/Copilot, “can be used to dramatically improve developer productivity with better code completion.”
But how do we measure productivity? Other researchers earlier this year, using a small sample of 24 developers, found that Copilot did not necessarily improve task execution time or success rate. However, he found that Copilot saved developers the hassle of searching online for code snippets to fix particular problems. This is an important indicator of the ability of an AI tool like Copilot to reduce context switching, when developers come out of an editor to fix a problem.
GitHub also surveyed over 2,600 developers, asking questions like, “Do people feel GitHub Copilot makes them more productive?” Its researchers also benefited from unique access to large-scale telemetry data and published the research in June. Among other things, the researchers found that between 60% and 75% of users feel more satisfied with their work when using Copilot, feel less frustrated when coding, and are able to focus on more satisfying work.
“In our research, we’ve seen that GitHub Copilot supports faster execution times, conserves developers’ mental energy, helps them focus on more satisfying work, and ultimately find more fun in the coding they do,” GitHub said.
GitHub researcher Dr. Eirini Kalliamvakou explained the approach: “We conducted multiple rounds of research, including qualitative (perceptual) and quantitative (observed) data to piece together the full picture. We wanted to verify: ( a) Do actual user experiences confirm what we infer from the telemetry? (b) Do our qualitative feedback generalize to our large user base?
Kalliamvakou, who participated in the original study, now relied on an experiment involving 95 developers that focused on the issue of coding speed with and without Copilot.
This research revealed that the group that used Copilot (45 developers) completed the task on average in 1 hour and 11 minutes. The group that didn’t use Copilot (50 developers) completed it on average in 2 hours and 41 minutes. Thus, the group with Copilot was 55% faster than the group without.
Kalliamvakou also found that a higher percentage of the co-pilot group completed the task – 78% of the co-pilot group versus 70% in the non-co-pilot group.
And the experiment did not examine factors that contribute to productivity such as context switching. However, previous GitHub research found that 73% of developers said Copilot helped them stay in the flow.
In an email, Kaliamvakou told ZDNET what the number means in terms of context switching and developer productivity.
“Reporting ‘staying in feed’ definitely involves less context switching, and we have further evidence. 77% of respondents said that by using GitHub Copilot, they spent less time searching,” he said. she writes.
“The statement assesses a known context switch for developers, such as finding documentation or visiting Q&A sites like Stack Overflow to find answers or ask questions. With GitHub Copilot bringing information into the editor, developers don’t need to leave the IDE as they often do,’” she explained.
But using context switching alone to measure productivity improvement from AI code suggestions cannot show the full picture. There is also “good” and “bad” context switching, which makes it difficult to measure the impact of context switching.
SEE: Data Scientist vs. Data Engineer: How Demand for These Roles is Changing
During a typical task, developers often switch between different activities, tools and sources of information, Kalliamvakou explained.
She pointed a study published in 2014 which found that developers spend an average of 1.6 minutes on an activity before changing, or changing an average of 47 times per hour.
“It’s just because of the nature of their work and the multitude of tools they use, so it’s considered a ‘good’ context switch. On the other hand, there is a ‘bad’ change out of context due to delays or interruptions,” she said.
“We found in our previous research that this is very detrimental to productivity, as well as to developers’ own sense of progress. Context switching is difficult to measure, because we don’t have a good way to automatically distinguish between “good” and “bad” instances – or when a change is part of performing a task or disrupts developer flow and productivity. However, there are ways to assess context change through self-reports and observations that we use in our research.”
As for Copilot’s performance with other languages, Kalliamvakou says she’s interested in conducting experiments in the future.
“It was definitely a fun experiment to do. These controlled experiments take a long time as we try to make them bigger or more comprehensive, but I’d like to explore testing for other languages in the future,” said- she declared.
Kaliamvakou published other key findings from the GitHub large-scale survey in a blog post detailing his quest to find the most appropriate metrics to assess developer productivity.