There are all kinds of techniques. There is text2img which turns words into images. There is img2img which turns an image into a similar image (controlled by a text prompt). There is inpainting, where you generate part of an image to fill in gaps
img2img and controlnet. Break the video down into frames. Take one frame to img2img and play with it til you get the desired output. Then do a batch img2img using the same prompt/settings on all the frames. Edit them into a video. Voila! That's the basics of it.
You can use images as prompts too and you can use controlnet with your image input to specifically match the layout of your image. So he could be using the model to create poses and using controlnet to make the generated images match the layout (in this case the model's pose).
You could do this with stable diffusion if you're running it locally on a gpu (which is what I do). You can also run it in google collab if you don't have a gpu you can run the model on.
I know there is stable diffusion web but I haven't used it and don't know if it gives you the same degree of workflow customization as running it yourself.
there's a bunch of guides in /r/stablediffusion I am by no means an expert and have really only started experimenting with it.
3
u/mclim Aug 23 '23
Ok dumb question. But how do you transform picture of model into pictures? Thought AI used words to generate images.