This would have been an epic release two years ago, but there are now many well-established models in this area (DALL-E, Midjourney, Stable Diffusion). It would be great to see some comparisons or benchmarks to show Imagen 2 is a better alternative. As it stands, it's hard for me to tell if this is worth switching to.
Stability AI has gaps in SDXL for text, but they seem to do a better job with Deep Floyd ( https://github.com/deep-floyd/IF ). I have done a lot of interesting text things with Deep Floyd
This is a pixel diffusion model that doesn't use latent space encoding, hence the memory requirements. Besides, good prompt understanding requires large transformers for text encoding, usually far larger than the image generation part. DF IF is using T5.
You can use Harrlogos XL to produce text with SDXL, although it's mostly limited to short captions and logos. The other way (controlnets) is more involved. (and is actually useful)
yeah stable diffusion has very limited understanding of composition instructions. you can reliably get things drawn, but it's super hard to get a specific thing in a specific place (i.e "a man with blonde hairs near a girl with black hairs" is gonna assign hair color more or less randomly and there's no guarantee on how many people will be on the picture) - regional prompting and control net somewhat help, but regional prompting is very unreliable and control net is, well, not text to image.
Right? This page looks like basically every other generative image AI announcement page as well as basically every model page. They show a bunch of their cherry-picked examples that are still only like "pretty good" (relative to the rest of the industry, it's incredible tech compared to something like deepdream) and give you nothing to really differentiate it.
I was going to pretty much state the same - the obvious, while also adding insult to injury by saying that with recent announces in the lats few weeks, it seems that Google desperately needs to shine in the world of AI, but fails to do so (despite 2000+ votes for new Bard, which is still not so good).
Now, from a designer perspective, honestly, I don't care too much who's the provider of the image, since one will have to anyway work more on it. So designers, illustrators, etc are not the target for such platforms, even though it seems counter-intuitive. If you ask me which system was the source for an image used for a poster last 12 months... well, I may remember, but is not of a paramount importance to the end result. After an year of active usage of DALLE2/3, SDXL, Midjourney (which is also SD of some sort) I can confidently state that there is much more work to do and a lot of prompting, to actually get something unique and something worth being used. Sadly the time taken is proportionate to working with actual real artist. Of course - the latter is likely to be hit by this new innovation, but perhaps not so much.
From the perspective of s.o. integrating text-ot-image - which is yet to be seen in a reasonable manner, like for a quest game with generative images - the API flexibility and cost would be the most important qualifier. Even then it may actually be better to run SD/XL. From cost perspective - all these services are still very pricey to be used for anything more serious than few one-shot images.