Getty Images

Developers warned: GitHub Copilot code may be licensed

Questions surround GitHub Copilot's use of open source code, but it's a Supreme Court decision on Warhol's art that developers should keep an eye on, according to one legal expert.

UPDATE, 11/7/2022: The Joseph Saveri Law Firm and Matthew Butterick, a lawyer and programmer, filed a class-action lawsuit against GitHub, Microsoft and OpenAI on November 3 concerning alleged open-source license copyright violations arising from the use of GitHub Copilot.

Two named plaintiffs, J. Doe 1 and J. Doe 2, who own copyrighted materials made available publicly on GitHub, brought the class action suit "on behalf of themselves and all others similarly situated," according to court documents.

In response to ongoing Copilot copyright issues, GitHub plans to add new capabilities to Copilot in 2023, according to a blog post attributed to Ryan Salva, vice president of product at GitHub. With these updates, developers should be able to locate licensing information for code fragments and access to an inventory of similar code found in GitHub public repositories. GitHub previously introduced a feature in June allowing developers to automatically block suggestions of more than 150 characters that match public code.

"We’ve been committed to innovating responsibly with Copilot from the start and will continue to evolve the product to best serve developers," a GitHub spokesperson said in an email November 4.

Accusations that GitHub Copilot steals code have intensified debate in the tech industry about what constitutes fair use of intellectual property, and raised the question of who is responsible when AI suggestions include unattributed but licensed code.

GitHub Copilot, released in June, translates natural language to suggestions for lines of code, which can range from boilerplate code to complex algorithms. The Codex artificial intelligence model behind GitHub Copilot is a natural language processor (NLP) trained on tens of millions of public repositories of code, including the majority of Python code stored on GitHub. Although Codex's developer, OpenAI, believes that the NLP is an instance of transformative fair use, legal investigations and the court of public opinion are beginning to challenge that notion.

Developers have raised questions this year about whether AI pair programmers produce code that can qualify as transformative fair use or if they infringe on copyrights. But this week saw that change from words to action when a team of class-action lit­i­ga­tors at Joseph Saveri Law Firm in San Francisco launched an investigation into a poten­tial law­suit against GitHub Copi­lot.

"Open source software creators, users and owners have serious concerns regarding Microsoft's new Copilot auto-coding product," the team stated on the law firm's website. "Microsoft is profiting from others' work by disregarding the conditions of the underlying open source licenses and other legal requirements."

The potential legal quagmire has one developer ill at ease.

"It is deeply concerning to me because how this plays out is going to determine a lot about which machine learning models get generated -- which will directly impact the usefulness of them," said Chris Riley, senior manager of developer relations at marketing tech firm HubSpot. "For example, if Microsoft loses [a lawsuit], that will open the door to sue OpenAI."

If the user of the Microsoft product is aware that they are knowingly using copyrighted material, it's the same as if any of us knowingly use copyrighted material, absent a transformative use.
Aron SolomonHead of strategy and chief legal analyst, Esquire Digital

In turn, if OpenAI is sued, then other tools that use the technology, such as content creation tool Jasper, might be off-limits -- which will have an unknown effect on current projects, Riley said.

But the potential for lawsuits isn't limited to product creators. Copilot users may be breaking copyright law if they use copyrighted material, said attorney Aron Solomon, head of strategy and chief legal analyst at Esquire Digital.

"If the user of the Microsoft product is aware that they are knowingly using copyrighted material, it's the same as if any of us knowingly use copyrighted material, absent a transformative use," he said. "Transformative fair use of code would either have to alter the code itself or it transforms what the code does."

The Copilot FAQ states that "GitHub does not own the suggestions GitHub Copilot generates. The code you write with GitHub Copilot's help belongs to you, and you are responsible for it."

Thus, developers should take steps to avoid legal problems down the road, Solomon said. Developers should do their due diligence, perhaps by pasting suggested code snippets into Google to ensure there's no copyright attached, he said.

"Or at least that very little of their code is subject to copyright," Solomon said. "It's like you or me using an image in a piece we write," he said. "If I Google 'Mona Lisa' and just save the first image I find, I'm pretty sure someone has rights to it. Very different from me Googling 'Creative Commons Mona Lisa' then going through the affirmative steps to make sure I attribute correctly."

The GitHub Copilot code legal battle

The Joseph Saveri Law Firm's website includes several news articles and social media criticisms, including one from programmer Chris Green. He tweeted in June that Copilot includes code from repositories with restrictive licenses -- including his.

"I checked if it had code I had written at my previous employer that has a license allowing its use only for free games and requiring attaching the license," Green said in his tweet. "Yeah, it does."

Last weekend, developer Tim Davis, a professor of computer science and engineering at Texas A&M University, also claimed on Twitter that Copilot had emitted large chunks of his copyrighted code with no attribution and no license.

Microsoft did not respond to an email request for comment. However, Alex Graveley, principal engineer at GitHub and creator of GitHub Copilot, responded to Davis's tweet. "The code in question is different from the example given. Similar, but different."

But transformative fair use boils down to more than how similar, or different, code appears. The Supreme Court has previously opined that a work is transformative if it "adds something new, with a further purpose or different character, altering the first with new expression, meaning or message."

The opinion, which concerned a parody of Roy Orbison's "Pretty Woman" by members of the rap music group 2 Live Crew, established that a parody -- which repurposes or transforms the original artist's intent -- can qualify as fair use. At first glance, the connection between pop song and AI-generated code may be unclear. However, the same court precedents that govern fair use with art and music also govern code, Solomon said.

A key precedent in any lawsuit concerning reused code is last year's ruling on Google vs. Oracle. The Supreme Court ruled that Google's use of copied Java API code to build its Android mobile operating system fell under the definition of fair use. In other words, the justices ruled that Google did not violate copyright law, by holding that code could be transformed, contrary to Oracle's argument that uses of software are nontransformative, he said.

But this could change when the Supreme Court rules on the Andy Warhol Foundation for the Visual Arts, Inc. v. Lynn Goldsmith, et al., case, Solomon said. That case concerns a portrait of musician Prince by rock photographer Lynn Goldsmith, and whether a work by Andy Warhol based on Goldsmith's photograph had transformed it.

Like the music-related precedent, there are aspects of this case that relate to transformed code.

"If Warhol's 'Orange Prince' was just an orange-y version of the Goldsmith photo upon which it's clearly based, then it's an intellectual property violation," Solomon said. "If new use of code just reuses the piece(s) of code without a transformation, there remains a legal argument that it's just theft."

Next Steps

GitHub Copilot replicating vulnerabilities, insecure code

Dig Deeper on Software design and development