Can anyone with access to it test it out on the specific example in the post (fursona ref sheets), and show examples? If it is actually able to generate those with high accuracy that would seem to be a unbelievable improvement.
Haven't tested it on that, but I did try it with an anime character who might be somewhat similar: Yoshida Yuuko, aka. Shamiko. Given that she's a very specific sort of demon girl, it was completely impossible to get anywhere close to her appearance with any pre-existing model.
I took ten high-quality pictures off Pixiv, spent five minutes cropping them, and ran Dreambooth overnight on a 3090 — this turned out to be enormous overkill, the final model was massively overtrained. Intermediate ones were fine.
The output is perfect; even higher quality than what the input model was already good at. This remains true even if I wander well outside the scope of the input pictures, e.g. different art styles, or Shamiko in a spacesuit wandering Mars, or...
How about Shamiko on a highly improbable forest run? A completely different art style / "medium", in a completely different pose from any of the input images. Mind you, it's lower quality; that's mostly because I didn't do any cleanup or img2img work.
It's still not as good as the real thing. She's missing her tail in both of these pictures — though she had it in others — but even if I filtered for that, it's missing the expressivity it usually has. Shamiko's an emotional girl, and a lot of that shows in tail behaviour...
I doubt I need to tell you that there is no way to say "Jealous Shamiko, with tail curled protectively around Momo". Not yet, at any rate.
>Jealous Shamiko, with tail curled protectively around Momo
Why can't this be done? It also seems like an object-composition problem, so assuming the model has some concept of "Momo" like it does for "forest" it would seem to be possible? Is this just a limitation of the Dreambooth finetuning process?
Also if you happen to have more samples of the Dreambooth output, could you share? (I want to see that Shamiko in a spacesuit wandering mars...)
I'm interested to know how well diffusion models can generalize across style (I guess this could be tested by keeping a fixed prompt and varying the initial seed state). Have diffusion models successfully learned some latent space for "style"? This doesn't matter for real-world objects since all apples look pretty much the same, but it matters a lot for 2D art where artists usually have a unique style (which I guess would be defined by proportions, palette, etc.).
It is an object-composition problem, and the AI _can’t usually handle those_.
The tail is also highly stylised, frequently taking up completely inorganic poses such as “jagged pikachu style shock/surprise line”.
One or the other of those might be manageable, at least by generating fifty pictures and picking the best. Both in combination means human input is absolutely necessary. Which doesn’t make the AI useless, by any means; it’s fully capable of acting as a collaborator.
That definitely is impressive, and thanks for pointing out it still misses key features! I will say though that those anime-styles, for lack of a better term, seem to do very very well when rendered by an AI. Other less simplified or less flat styles it struggles with much much more (not to say this style is lesser in any way, however).
So this seems to be more of a incremental improvement, from a layman's view. I'm sure it's technologically impressive, but it looks like it won't be generating ref sheets. Thank you for testing!