Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This explanation is wrong as I've already said (256 is not the result of any conversion to text) but no one has to take my word for it.

From the Gemini report https://arxiv.org/abs/2312.11805

>The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).

These are the papers Google say the multimodality in Gemini is based on.

Flamingo - https://arxiv.org/abs/2204.14198

Pali - https://arxiv.org/abs/2209.06794

The images are encoded. The encoding process tokenizes the images and the transformer is trained to predict text with both the text and image encodings.

There is no conversion to text for Gemini. That's not where the token number comes from.



Stewing so much you had to double-dip reply? Ouch.

As much as I would love to waste my time replying again to your magic thinking, instead I'll just politely chuckle and move on. Good luck.


>As much as I would love to waste my time replying again to your nonsense, instead I'll just politely chuckle and move on. Good luck.

You have your head so far up your ass even direct confirmation from the model builders themselves won't sway you. The comment wasn't for you. The comment is linked sources for the original poster and for the curious.

You see I don't have to hide behind a veneer of "Trust me bro. It works like this".


>even direct confirmation from the model builders themselves

Linking papers that you clearly haven't read and can't contextually apply -- as with the ViT or your misunderstanding of image tiling -- is not the sound strategy you hope it is. It doesn't confirm your claims.

I'm not asking anyone to "Trust me bro". So...have you called the Gemini Pro 1.5 API and tokenized an image or a video yet?

There is a certain element of this that is just spectacularly obvious to anyone who spent even a moment of critical thought -- if they're so capable -- on it. Your claim is that a high resolution image is tiled to a 16x16 array...and the magic model can at some later point magically on demand extract any and all details, such as OCR, from that 16x16. This betrays a fundamental ignorance of even the most basic of information theory.

Again, I would love to just block you and avoid the defensive insults you keep hurling, but this site lacks the ability. Stop replying to me, however many more contextually nonsensical citations you think will save face. Thanks.


>So...have you called the Gemini Pro 1.5 API and tokenized an image or a video yet?

You continue to blow my mind. Have you...have you even used the gemini pro api before ? You can't use the api to get the image tokens.

>This betrays a fundamental ignorance of even the most basic of information theory.

Wow, something else you don't understand. Go figure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: