Stable Diffusion's lawsuit may cause an earthquake in the AI world
When I asked the software to draw "Mickey Mouse" in front of the McDonald's sign, it generated the image you see above.
Stable Diffusion can do this because it is trained on hundreds of millions of example images collected from the web. Some of these images are in the public domain or released under licenses such as Creative Commons. Many others are not, and the artists and photographers of the world are not happy with it.
In January, three visual artists filed a class-action copyright lawsuit against Stability AI, the startup that created Stable Diffusion. In February, image licensing giant Getty filed its own lawsuit.
"Stability AI has copied more than 120,000 photos, along with associated captions and metadata, from Getty Images' collection without permission or compensation from Getty Images," Getty wrote in its lawsuit.
Generative AI is a new technology, and courts have never ruled on its copyright implications. There are some strong arguments that copyright's fair use doctrine allows Stability AI to use these images. But there are also strong arguments on the other side. The court is likely to rule that Stability AI violated copyright law on a massive scale.
It would be a legal earthquake for an industry that is still in its infancy. Building cutting-edge generative artificial intelligence requires licensing from thousands, if not millions, of copyright holders. This process can be very slow and expensive, and only a few large companies can afford it. Even so, the resulting models may not be as good. Smaller companies may be shut out of the industry entirely.
"Complicated collage tool"?
The plaintiffs in the class action describe Stable Diffusion as a "sophisticated collage tool" that contains "compressed copies" of its training images. If true, the case would be a coup de grace for the plaintiffs.
But Erik Wallace, a computer scientist at the University of California, Berkeley, said that claim contained "technical inaccuracies" and "significantly exaggerated the truth."
Wallace points out that Stable Diffusion is only a few gigabytes in size—too small to contain compressed copies of all, or even very many, training images. In fact, Stable Diffusion works by first converting the user's cues into a latent representation: a list of numbers summarizing the content of an image.
Just as you can identify points on the Earth's surface by their latitude and longitude, Stable Diffusion characterizes an image by its "coordinates" in "image space". Then, it converts this latent representation into an image.
If you let the Steady Diffusion paint "Golden Retriever Watercolor on the Beach", it would produce an image similar to the top left of this image. To do this, it first converts the cues into the corresponding latent representation, a list of numbers summarizing the elements that should be present in the picture. Maybe a positive value at bit 17 means a dog, a negative value at bit 54 means a beach, a positive value at bit 73 means a watercolor painting, and so on.
I just made these numbers up for illustrative purposes; the real underlying representation is more complex and not easily interpretable by humans. However, there is a list of numbers corresponding to the cues anyway, and stable diffusion uses this latent representation to generate the image.
The pictures for the other three corners were also generated by stable diffusion using the following cues:
- Top right: "Still life DSLR photo of a bowl of fruit"
- Bottom left: "Starry Night Style Eiffel Tower"
- Bottom right: "Architectural sketches for skyscrapers"
The point of the six-by-six grid is to illustrate that the latent space of stable diffusion is continuous; the software can draw not only images of a dog or a bowl of fruit, but images "between" a dog and a bowl of fruit. For example, the third photo in the first row depicts a slightly fruity dog sitting on a blue plate.
Or look along the bottom row. As you move from left to right, the shape of the building gradually changes from the Eiffel Tower to a skyscraper, and the style changes from a Van Gogh painting to an architectural sketch.
The continuity of the steady-diffusion latent space enables the software to generate latent representations for concepts not present in its training data, and thus images. There may not be "Eiffel Tower drawn in 'Starry Night' style" images in Stable Diffusion's training set. But there are a lot of images of the Eiffel Tower, plus a lot of "Starry Night" images. Steady Diffusion learns from these images and is then able to produce images that reflect these two concepts.
Stable Diffusion training mode
How did Stable Diffusion learn to do this?
A novice painter might go to an art museum and try to make an exact replica of a famous painting. The first few tries won't be great, but she gets a little better each time. If she persists long enough, she will master the style and technique of the paintings she reproduces.
The process of training an image-generating network like Stable Diffusion is similar, except it happens at a much larger scale. The training process uses a pair of networks designed to first map an image into a latent space and then use only its latent representation to reproduce the original image.
Like a novice painter, the system initially did a poor job; the first images the network generated looked like random noise. But after each image, the software scores its success or failure and adjusts its parameters to do slightly better on the next image.
The key word here is: each training image should have only a small effect on the behavior of the network. The network learns general characteristics of dogs, beaches, watercolors, etc. But it shouldn't learn how to reconstruct any particular training image. Doing this is called "overfitting" and network designers work hard to avoid it.
This is important because copyright law protects creative expression, but not world facts. You can copyright a particular drawing of a dog, but you cannot copyright the fact that a dog has two eyes, four legs, a tail, etc. Thus, a network that avoids overfitting will be built on a safer legal basis.
fair use case
In the mid-2000s, Google began scanning books in libraries to create a book search engine. The author responded by suing Google and its library partners for copyright infringement.
Google has argued that its scans are fair use, emphasizing that scanned books are never displayed to users. In two rulings in 2014 and 2015, the appeals court sided with Google and its library partners. "The result of a word search differs in purpose, character, expression, meaning, and message from the page (and book) on which it is drawn," the court argued in its 2014 ruling.
Other copyright rulings point in the same direction. In 2009, another appeals court dismissed a copyright lawsuit against anti-plagiarism service TurnItIn. The students sued, arguing that the company violated their copyright by keeping copies of their papers without permission. The court disagreed, noting that Turnitin never published the students' papers and that the service was not a substitute for the papers.
In short, a great deal of leeway is provided for what legal scholar Matthew Sag calls non-expressive use of a copyrighted work—the copyrighted work can only be created by computer programs rather than Human "reading" uses.
Stability AI has yet to respond to the lawsuits, but experts I spoke to wanted the company to compare Stable Diffusion to services like Google Book Search and TurnItIn. It might point out that the training images are only "viewed" by computer programs, not humans. Some experts, including Sag, believe this should be a winning argument for stable AI.
I'm not sure. As we have seen, a key assumption of the "non-expressive use" defense is that Stable Diffusion only learns from its training images the fact that it is not copyrightable, not the creative expression. This is basically true. But that's not quite true. These exceptions could significantly complicate the legal defense of Stability AI.
Not just Stable Diffusion
Stable Diffusion is an open source product that has been incorporated into other image generation tools, including Midjourney. Midjourney is also named as a defendant in a class action lawsuit against Stability AI.
OpenAI and Microsoft are also facing lawsuits over GitHub Copilot, a code completion AI derived from OpenAI's GPT-3. It may only be a matter of time before these companies face lawsuits for using copyrighted text to train ChatGPT and GPT-4. OpenAI's DALL-E, Google's Bard and other generative AI systems could also be vulnerable to lawsuits if plaintiffs can prove they were trained on copyrighted material.
If the plaintiffs win — which seems like a possibility — it would throw the nascent industry into turmoil. Many (maybe even most) companies that provide generative image and language models may be forced to shut them down. Companies will scramble to collect public domain and licensed datasets.
Big players like Google, Microsoft, and Meta will have an inherent advantage here. Not only do they have the cash to sign licensing deals with major copyright holders like Getty, but they may also get permission to use user data to train models.
I think the long-term outcome will be to further consolidate these big tech companies. Some of them are already at the forefront of this emerging technology thanks to heavy spending on R&D. But they face competition from rivals like Stability AI, a startup that successfully trained Stable Diffusion for about $6 million.
But if companies lose those lawsuits, the cost of training cutting-edge models will rise sharply. It may actually be impossible for a new company to compete with an incumbent to train a new model. That doesn't spell the end of AI startups -- big companies may license their models for use by smaller companies. But it would represent a sea change in the structure of the industry.