Home Web internet 3 Questions: How AI Image Generators Could Help Robots | MIT News

3 Questions: How AI Image Generators Could Help Robots | MIT News


AI image generators, which create fantastical views at the intersection of dreams and reality, are bubbling all over the web. Their entertainment value is demonstrated by an ever-expanding trove of whimsical, random images serving as indirect portals to the brains of human designers. A simple text prompt delivers an almost instantaneous image, satisfying our primitive brains, which are wired for instant gratification.

Although seemingly nascent, the field of AI-generated art dates back to the 1960s with early attempts using rule-based symbolic approaches to creating technical images. As the progression of models that unravel and analyze words has grown in sophistication, the explosion of generative art has sparked debates around copyright, misinformation and bias, all mired in hype and Controversy. Yilun Du, a doctoral student in the Department of Electrical Engineering and Computer Science and affiliated with MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), recently developed a new method that makes models like DALL-E 2 more creative and have a better understanding of the scene. Here, Du describes how these models work, whether this technical infrastructure can be applied to other fields, and how we draw the line between AI and human creativity.

Q: The AI-generated images use so-called “stable diffusion” patterns to turn words into stunning images in just moments. But for every image used, there is usually a human behind it. So what is the line between AI and human creativity? How do these models actually work?

A: Imagine all the images you could get on Google search and their associated templates. This is the diet these models are fed. They are trained on all these images and their captions to generate images similar to the billions of images seen on the internet.

Let’s say a model has seen a lot of pictures of dogs. It’s trained so that when given a similar text input prompt like “dog”, it’s able to generate a photo that looks a lot like the many dog ​​images we’ve seen before. Now, more methodologically, how this all works goes back to a very old class of models called “energy-based models”, originating in the 70s or 80s.

In energy-based models, an energy landscape on images is constructed, which is used to simulate physical dissipation to generate images. When you drop a dot of ink into water and it dissipates, for example, at the end, you just get that uniform texture. But if you try to reverse this dissipation process, you gradually recover the original ink point in the water. Or let’s say you have this very complex tower of blocks, and if you hit it with a ball, it collapses into a pile of blocks. This pile of blocks is then very messy, and there is not really any structure. To resurrect the tower, you can try reversing this folding process to generate your original block pile.

The way these generative models generate images is very similar, where initially you have this very nice image, where you start from this random noise, and you basically learn how to simulate the process of reversing this process of going noise to your original image, where you try to iteratively refine that image to make it more and more realistic.

When it comes to the line between AI and human creativity, you can say that these models are really trained on people’s creativity. The internet offers all types of paintings and images that people have ever created in the past. These models are trained to recap and generate the images that have been on the internet. As a result, these patterns are more like crystallizations of what people have spent their creativity on for hundreds of years.

At the same time, because these models are trained on what humans have designed, they can generate works of art very similar to what humans have done in the past. They can find patterns in the art that people have created, but it’s much harder for those patterns to generate creative photos themselves.

If you try to enter a prompt like “abstract art” or “unique art” or similar, it doesn’t really understand the creative aspect of human art. Rather, the patterns recap what people have done in the past, so to speak, instead of generating fundamentally new and creative art.

Since these models are trained on large swaths of images from the internet, many of these images are likely copyrighted. You don’t know exactly what the template is picking up when it generates new images, so there’s a big question as to how you can even tell if the template is using copyrighted images. If the model depends, in some sense, on some copyrighted images, then are those new images copyrighted? This is another question to address.

Q: Do you believe that the images generated by diffusion patterns encode some kind of understanding of the natural or physical worlds, dynamically or geometrically? Are there efforts to “teach” image generators the basics of the universe that babies learn so early?

A: Do they understand, in code, some understanding of the natural and physical worlds? I definitely think so. If you ask a model to generate a stable block configuration, it definitely generates a stable block configuration. If you tell it, generate an unstable configuration of blocks, it seems very unstable. Or if you say “a tree next to a lake” it’s pretty much able to generate that.

In a sense, it seems that these models have captured a broad aspect of common sense. But the problem that still takes us very far from really understanding the natural and physical world is that when you try to generate infrequent word combinations that you or I, in our work, our minds can very easily imagine. , these models cannot.

For example, if you say “put a fork on a plate,” it happens all the time. If you ask the model to generate this, it easily can. If you say “put a plate on a fork”, again, it’s very easy for us to imagine what that would look like. But if you put that in one of those big models, you’ll never get a plate on a fork. Instead, you get a fork on a plate, as the models learn to recap all the images they’ve been trained on. He can’t really generalize as well to combinations of words he hasn’t seen.

A fairly well-known example is an astronaut on horseback, which the model can easily do. But if you say a horse rides an astronaut, that still generates a person riding a horse. It seems that these models capture a lot of correlations in the datasets they are trained on, but they don’t actually capture the underlying causal mechanisms of the world.

Another commonly used example is if you get very complicated text descriptions like one object to the right of another, the third object in front, and a third or fourth flywheel. It is really only able to satisfy maybe one or two of the objects. This could be partly due to training data, as it is rare to have very complicated legends. But it could also suggest that these models are not very structured. You can imagine that if you get very complicated natural language prompts, there is no way the model can accurately represent all the details of the components.

Q: You recently developed a new method that uses multiple models to create more complex images with a better understanding of generative art. Are there potential applications of this framework outside of the image or text domains?

A: We were really inspired by one of the limitations of these models. When you give these models very complicated scene descriptions, they are not able to correctly generate images that match them.

One thought is, since this is a single model with a fixed compute graph, meaning you can only use a fixed amount of compute to generate an image, if you get an extremely complicated prompt, there is no way to use more computing power to generate this image.

If I gave a human a description of a scene that was, say, 100 lines compared to a scene that was only one, a human artist can spend a lot more time on the first one. These models don’t really have the sensitivity to do that. So we’re proposing that with very complicated prompts, you could actually compose many different independent models together and have each individual model represent a part of the scene you want to depict.

We find that this allows our model to generate more complicated scenes, or ones that more accurately generate different aspects of the scene together. Moreover, this approach can be generally applied to a variety of different fields. While image generation is probably the most successful application currently, generative models have actually seen all types of applications in a variety of fields. You can use them to generate different robot behaviors, synthesize 3D shapes, provide better scene understanding, or design new materials. You can potentially dial in multiple desired factors to generate the exact material you need for a particular application.

One thing that interests us a lot is robotics. In the same way that you can generate different images, you can also generate different robot trajectories (the path and the timeline), and by composing different models together, you can generate trajectories with different skill combinations. If I have some natural language specs on jumping versus obstacle avoidance, you can also compose those models together and then generate robot trajectories that can both jump and avoid an obstacle.

Similarly, if we want to design proteins, we can specify different functions or aspects – analogous to how we use language to specify the content of images – with language-like descriptions, such as type or the functionality of the protein. We could then compose them together to generate new proteins that can potentially perform all of these given functions.

We also explored the use of diffusion models on 3D shape generation, where you can use this approach to generate and design 3D assets. Normally designing 3D assets is a very complicated and laborious process. By composing different models together, it becomes much easier to generate shapes like “I want a four legged 3D shape, with this style and height”, potentially automating parts of 3D asset design.