Hi, this is Wayne again with a topic “DALLE: AI Made This Thumbnail!”.
What, if i told you there is a system right now that can take natural language input. So whatever description you want just make something up, and it will take that text and turn it into a surprisingly realistic image of exactly what you described. So you type an astronaut riding a horse and it spits out a brand new image of an astronaut riding a horse. You type teddy bears shopping for groceries and boom. There’S an image of teddy bear shopping for groceries. You type a bowl of soup. That is a portal to another dimension and boom, my god, it’s a bowl of soup. That’S a portal to another dimension, and it’s not just one. It actually spits out 10 different versions across a spectrum of variation in any art style.
You want you name it and it can draw it. So what is happening here? How does it work and what happens if i try so first things? First, yes, this does exist. This is a real thing. It’S called dolly 2 and it’s an ai research project by a company called openai one of many elon musk co-founded companies.
At this point, and so the purpose of this ai specifically, is to create original, realistic images and art from a text description. This is aditya ramesh, a researcher and co-creator of dolly one and dolly two he’s easily the most qualified person to explain what’s happening here, um, so the way w1 generates images. Dolly one generates an image starting from the top left and moving industrial order row viral. So diffusion works completely differently.
The way diffusion works is we train a model to reverse a corruption process. That’S applied to clean images, so it’s kind of hard to wrap your head around, but basically there’s two main ai technologies behind dolly two they’re called clip and diffusion. So clip is the part, that’s matching images to text and basically uses that match to train the computer to understand concepts in images, so it can generate new images of the same concepts so when i asked it for an astronaut riding a horse, for example, it’s not Just making a mosaic of images it found online, it knows the idea of what an astronaut is. It knows what the concept of writing means.
It knows what a horse is, and maybe most impressively it knows, what’s an aesthetically pleasing image to humans. So then it can create a completely new visual version of this idea that hasn’t existed before now. Clip doesn’t really have the ability to do the pretty high resolution. Images all by itself, it’s just more generating the gist of an image based on those concepts, so that is where diffusion comes in.
So diffusion is super impressive, basically, by teaching a computer to corrupt an image by adding gaussian noise. It can then learn to uncorrupt or enhance an image by removing that noise. It’S kind of like step. One draw the circle step, two draw the rest of the owl, so i don’t know if you’ve ever seen. This website called this person does not exist.com. But if you haven’t, you should check it out.
It shows you a surprisingly realistic image of a face, but, as you might have guessed, this person does not exist. It’S not a real face. It’S actually using ai to look at thousands of faces and then generate a new face with that information that is shockingly realistic, but it turns out is not a real human, so dolly dolly 2 is like a way more advanced, generalized version of that for anything. So when you open it up, it’s literally just a blank search text box where you can type in whatever you want to create. Now, of course, as you can probably imagine with all these concerns and possibilities, this isn’t just a tool that’s available to the public. It’S not like anyone can use it.
Openai has kept this mostly behind closed doors to a very small hand-selected group of people, but for a day they gave me the keys and i was able to generate whatever i want, which of course means i had to ask it to finally reveal to us what The long-awaited apple car would look like i mean this is an opportunity, unlike any other, so i typed it in. I waited my 10 seconds with great breath and then the secret was finally unveiled. Oh right, of course, i don’t know why i expected anything different, but for real okay, so the open ai team was kind enough to allow me to feed dolly 2. Whatever i want so, i decided to start pretty simple and then get a little bit more complex as we go so a blue apple and a bowl of oranges.
So, okay, these are good. These are actually i mean that was extremely easy, but the sharpness, the realism, the lighting, even to just create these brand new out of nothing. There is so much detail in this one, it’s kind of hard to believe it isn’t real. Okay, an elderly kangaroo.
I mean i don’t know what i expected specifically an elderly kangaroo to look like i guess. Maybe i pictured gray, hair or something but i buy it. I mean the fact that it’s again, it’s not a real photo, but it looks like a real photo of an elderly kangaroo that is very impressive, a wise elephant staring at the moon at night, whoa, okay, so that is definitely a wise elephant. He or she is in fact staring at the moon and it is definitely at night. It’S not bad. The moon does look a little bit wonky if you, if you look a little bit closer on some of these, it’s not perfect, but the elephant is very real.
Looking okay, let’s get a little more specific here, a teddy bear doing surgery on a grape in the style of a 1990s cartoon. Oh my god! Look at these cartoons. Sometimes it misses totally understandable. Also, it seems to have chosen scissors instead of maybe a more realistic. Actual surgery i’ll get to why in a minute, but the facial expressions the feet and everything i mean, that is a teddy bear doing surgery on a grape all right, this one’s from mac, the studio dog, a cooker hunch, i’m pronouncing that wrong using a camera on A movie set wow that is okay, if you couldn’t already tell that is the name of the dog breed and the closer you inspect each individual image, the more the photo realism, part kind of falls apart, which maybe isn’t shocking, because this is kind of a crazy Thing to have a picture of, but the detail in the dog breed and it actually using the camera in the pictures is crazy. Good.
I wonder if we could post that to mac’s instagram if anybody would notice that it’s not a real picture, i’d probably figure it out all right: a robot woman, guarding a wall of computers, wow, okay, so so many interesting details and decisions being made in these images. So the word guarding implies a bit of a pose and there’s a couple different guarding poses here, but that’s cool. The computers for the most part are also pretty convincing. If you don’t zoom in too much, and also it’s interesting, that none of the walls of computers go all the way up to the ceiling either, which is cool, but that is definitely a robot woman guarding that wall of computers.
Alright, what if we go a tiger discovering the lost city of atlantis, wow? Okay, these are more of an art style, probably because one there won’t be any like photo realistic reference images of the lost city of atlantis. So i imagine it’ll, look better this way and two. This is a crazy image to create so with each of these they’re great without zooming in and pixel peeping, and they very much accomplish the goal of illustrating a tiger discovering atlantis like. I asked the crazy part here, though, to me is how much imagination it’s using like i’m actually getting more than what i asked for the facial expressions poses orientation of things.
Reflections, even the accurate lighting in shadows is crazy, like i asked for a tiger discovering atlantis here, but it’s decided to add trees and birds and a moon all by itself. All right here we go here. We go a painting inspired by the mona lisa of a goat, taking pictures with an ipad. I this is my new favorite thing you, you can really just go off the rails with complexity and it just gets them right. Almost all of these goats have hands too, which is hilarious, but the drawings themselves have actually also stayed true to the theme. It’S a painting in the style of the mona lisa and the tablets are all you know, varying levels of convincing ipads wow. I’M going to put these all on twitter by the way in, like one big thread plus a few extras. If they don’t make it into the video so definitely hit the link below.
If you want to see those, but last but not least, a cyclops riding a tractor listening to airpods in the style of the simpsons i mean come on. Maybe it’s not a perfect cyclops and it is interesting that it’s chosen over the ear headphones for all of the headphones and not you know airpod earbuds, but i feel like there’s nothing. This can’t do this is one of those ai tools.
That’S so good that it almost brings up more questions than it answers like. Why does a tool like this even exist in the first place? Well, dolly 2 is a research project. Not a customer product and open ai’s goal is to create good, safe, general ai, which is really hard like. There are a lot of really really good task. Specific ai systems that’ll do things from like detecting cancer, in x-rays to self-driving, cars that navigate the streets or just sharpening photos in photoshop, but the whole general ai thing, which needs a ton of information to be able to navigate a ton of different situations is a Whole other challenge, i mean, if you think, like tesla, robot walking around the earth, completing tasks for you, like that’s the level we’re talking about here and so being able to recognize objects and images and associate them very quickly and accurately is a big part of that. Now are there things that dolly doesn’t do? Well? Yes, actually, there are both some intentional and some unintentional shortcomings of dolly 2 as it exists right now, so on the intended side, uh, the library of images that dolly references is massive, but it doesn’t have any images of adult content or illegal activity or violence. So it doesn’t create images with that stuff in it makes sense. That’S probably why we got scissors in the teddy bear’s hand instead of a knife, because that’s the next closest association the ai was able to make for that surgery, and you also can’t ask for imagery of specific identities of people. So you can ask for man robbing a bank, but you can’t ask for marquez brownlee robbing a bank as curious as i am about what type of image that would spit out.
You can’t that would be dangerous for obvious reasons, but also dolly 2 is known to have some quirks, so one of them is, it doesn’t do very well specifically with variable binding or what basically will happen when you ask for relative position of objects in an image. So if i ask for a red cube on top of a blue cube, it might just give you a blue cube on top of a red cube, and we actually saw this in one of the images i got back for a blue apple and a bowl of Oranges well, right there that’s clearly an orange in a bowl of blue apples which is kind of funny, and it also for whatever reason, doesn’t do written text well. So sometimes it can give you certain letters, but if you ask it for like a sign that says a certain word, it’ll almost never actually give you that there’s actually a pretty hilarious twitter thread of someone asking dolly for signs with certain things. Over and over again. Just to see what random text it spits out, which is also pretty funny, but this is the type of stuff that i’ll be working on for dolly 3 and for future versions, as you can imagine, but it’s funny with every shortcoming they found. There was also an equally awesome accidental upside. They discovered too, like this diffusion method can also transform images, so you can take an existing image and run the model over and over to push it more and more towards any prompt you want. So you can take this plain jacket, for example, and slowly turn it into a jackson, pollock painting or take this picture of a cat and slowly turn it into a samurai master or take a picture of tech, a piece of tech and slowly un-modernize it over and Over like look what it does to this iphone, it turns it back into an older and older phone.
It’S modifying existing images based on other existing concepts. It’S it’s kind of sick! So is this going to be taking people’s jobs? Well, lucky for you! If you want the exact answer to that question, that’s literally the concept we attacked with the new studio, video, so i’ll link it below the like button. If you want to watch it, but we pit dolly 2 up against tim, who is the graphic designer here at the mkbhd studio, where their jobs are kind of? Basically, the same thing.
It’S to turn the words coming out of my mouth into a good-looking image. Spoiler alert, if you give tim enough time, he’ll make something better, but in 10 seconds dolly is able to spit out a bunch of different variations, and while the images might be a bit fuzzy around the edges or have weird text or fall apart when you zoom In on faces or hands or objects, this tool, as presently constructed, is amazing for brainstorming ideas and concepts, and things that would normally take much longer to create is truly an amazing side effect of the development of this ai that it’s able to make this tool where The images that it spits out aren’t necessarily supposed to be finished final pieces of work, but they are a great starting point for making stuff later. That’S actually exactly what we did with this video’s thumbnail, which started off as an image generated by dolly where it was told to make a robot hand drawing.
So i have no doubt that there will be versions of dolly in the future that make even higher resolution and more photorealistic images and then even better, like quick, animations and then video clips and then whole movies, even all on our way to this general ai goal. That we’re working towards what a time to be alive thanks for watching catch, you guys the next one peace .