The burden of AI

Being a maintainer in the age of AI.

Ralf Gommers wrote a really nice blog post around 6 years ago about the cost of an open source contribution , where he talks about what happens when a contribution comes in, the challenges, and the bottlenecks. Since then, open source has become even more mainstream, and the number of contributors has gone up, while the number of maintainers in many projects staying more or less the same.

On top of that, with the expansion of AI tools and their general availability, an increasing number of people seem to be trying their luck at “vibe contributing” to open source, which is an issue when the contributions are submitted w/o a proper human review and when the contributor doesn’t understand their own submission; and it’s quite easy to do. These days you can very easily either prompt your IDE or develop an agent, who reads contributing guidelines of a project, finds an issue, and comments with the plan to fix. Or even, given a link to an issue, generate a “solution” and submit a pull request (PR).

One of the challenges we’ve been having as maintainers, is dealing with LLM generated comments showing their intent to work on an issue, auto generated issues, and low quality, LLM generated PRs. These come almost exclusively from “first time contributors” who seem to want to have a contribution in our project w/o understanding their submissions. The issue has gotten so bad that at times almost every second issue on our main repo gets at least one such message, in many cases multiple ones. In most cases, if they had read the issue thread, they’d notice that it’s not ready for contributions in the first place.

AI Generated Contributions

The interactions we have with users who post LLM generated content have a few different forms:

New Issues: We’ve seen AI generated issues, sometimes multiple issues at once. Although at first some maintainers were willing to engage, soon we realised they’re mostly bogus, and us spending a significant amount of time to check whether they’re legit or not is not worth it. Especially when you take into account that those issues are not coming from a real use-case, and are purely generated by AI. Therefore even if they’re a real issue, nobody’s asked for it.
New Pull Requests: The number of seemingly AI generated PRs has gone up substantially. When it’s rather clear to us that the PR is generated by AI, it’s easier to close it, and even block the user from the GitHub org. However, that takes maintainer time and attention, and not all AI generated PRs are clearly AI generated. In many cases, if not most, we realise/guess it’s AI generated since the submitter simply ghosts the contribution and never responds to reviews. At that point, a reviewer has already spent substantial time reviewing and leaving helpful comments.
PR Reviews: Sometimes people (almost always first time contributors) leave reviews on PRs that are generated by AI. These are a bit easier for us to detect and mark as spam, but they confuse the original PR submitter if we don’t get to them fast enough. I’ve even seen conversations between the PR submitter and a reviewer where it’s clear the submitter is talking to an AI.
Comments on Issues: There are two types of AI related comments we need to moderate these days:
- First are the kind trying to solve the submitter’s problem, or to give advice on how to proceed. Similar to PR reviews, they can be very confusing and directing contributors towards a wrong direction. We’ve even had cases where it seems two AI agents are talking to one another, and after blocking those users and hiding their comments as spam, most of a long issue thread is gone.
- Second are the ones trying to “grab” the issue for themselves to solve and submit a PR, since that’s how we recommend people to contribute in our contributing guidelines. It seems people are directing their agents to follow the project’s guidelines, and the agent mostly does. What keeps happening in reality, is multiple people leaving very similar comments on the same issue, trying to “grab” it, while the issue is not even triaged and no maintainer has had time to check if it’s a real issue.

Model/System Evaluation and Dataset Creation

Sometimes interactions with AI generated content on the repo is due to an individual, a research group, or a company trying to gather real world data by testing their models and systems on our projects. This results in those entities forcing those interactions on project maintainers, for free. We’ve even seen people submitting issues anonymously, and being open about the fact that it’s to gather validation data for their research paper. That’s insane, to expect free maintainer time to gather data, which is the exact thing being the bottleneck in project development in open source.

Luckily, that trend seems to be going down and now many companies are paying maintainers an hourly rate to gather data on a separate repo where the workflow on the actual main repo of the project is not affected.

If you want to test your models, first ask for permission, and budget it in your research proposal / internal budget. Maintainers do not like to have bots on their repos w/o them explicitly enabling them.

Trust and Rapport

A lot of open source is built on trust. We used to be able to easily trust users that they’ve built the code they’re submitting. However, generating code has become really cheap, while reviewing it is as expensive and time consuming as it’s always been.

People used to contribute to projects they like, and now many contribute to many projects just to build a portfolio, and they’re trying to reduce the cost of building that portfolio as much as they can.

This has resulted in our default level of trust to be reduced. It changes the way we treat first time contributors and is harming them by making their experience less enjoyable, as well as hurting the project by losing some potential long term contributors.

Battling the Surge

In our community at scikit-learn, this has been a central point of discussion in the past couple of months. People don’t approach the issue all the same way, and some are more tolerant to such content than others.

Some maintainers are flagging the issues / PRs with a “spam” label, to come back to them later and see if there’s been a kind of activity from the original poster convincing them it shouldn’t be spammed / banned.

Some other maintainers are banning any user who seems to be using AI where the generated code quality is not worth the review time.

GitHub also released an ai-moderator action which uses AI to find AI generated content. We’ve been discussing whether to use it in our project or not. We might test it out, but it also doesn’t seem too hard to circumvent it, since the prompts used to detect AI generated content are also open source and easy to include in an adversary’s agent’s system prompt.

Another idea has been to give first time contributors a sort of a task to somewhat prove they’re human, but that creates quite a bit of noise, and makes the experience of people genuinely trying to contribute less pleasant.

We’re also thinking of asking people to check a box when submitting issues and PRs stating that they understand our rules of play, including our policy on AI generated content.

Use AI the Right Way!

In this day an age, it would be unreasonable to expect folks not to use any AI tools. Many of us use these tools one way or another. However, you should never submit contributions w/o understanding what you’re submitting. You should be spending at least as much time creating your contribution, as it takes a maintainer to review it. These are some of the ways you can engage with your AI tools:

Explain codebase : Existing tools are rather good at explaining what a piece of code does, or for them to show you which parts of the codebase are relevant for a given task. They can substantially speed up your learning curve on an existing codebase. Note that there’s no need for you to share these explanations on the issue tracker. Maintainers already know their codebase.
Help with boilerplate : In many cases, there’s some boilerplate code which needs to be written and sometimes repeated a few times in a contribution. The existing tools are rather good at auto-completing your code to speed up your development process.
Help giving you a starting point : You can use your coding agents to get to a starting point which would give you an idea of what a solution to the problem might look like. However, you should never submit that generated code w/o understanding it and most probably, modifying it to fit the codebase’s style and for it to be maintainable by humans. Note that sometimes this might actually slow you down, since very often writing code from scratch can be faster than fixing AI generated code.

Closing Thoughts

All of this is wearing maintainers down, and maintainers are losing patience. When these contributions started coming in the form of an issue or a PR, we’d welcome some and try to engage. However, as time passes the number of such posts have increased, which also either ghost their contribution after submitting or cannot apply reviews since they never understood the code they were submitting in the first place. This has resulted in us, as maintainers, increasingly feeling that engaging with those activities is wasting our time, and distracting us from productive work.

If you’re somebody who’s thinking of doing open source work solely based on AI contributions w/o understanding what you’re doing, please don’t! You’re harming the project, and everybody who’s trying to have genuine contributions.

And if you genuinely tried to have a contribution but got blocked before even the first interaction, please know a vast majority of maintainers out there are happy to work with you if they know it’s actually you who wants to learn and contribute! Contact them and resolve the issue.

Comments

Adrin Jalali

The Cost of AI in Open Source Maintenance

Being a maintainer in the age of AI.

AI Generated Contributions

Model/System Evaluation and Dataset Creation

Trust and Rapport

Battling the Surge

Use AI the Right Way!

Closing Thoughts

Published

Category

Tags

Contact

Recent Posts all posts