Is the format used in the examples the same that's used in the function calling ...

canyon289 · on March 27, 2025

We feel this model excels at instructability which is why we're recommending bringing your own prompt! Benchmark wise you can see this performance from BFCL directly, they (independently) ran their eval using their prompted format the larger Gemma models performed quite well if you ask me.

Specifically though I want to thank you for leaving a comment. We're reading all this feedback and its informing what we can do next to reduce frustration and create the best model experience the community

jampekka · on March 27, 2025

Do you mean that the exact prompt for tool use shouldn't matter? Has this been tested? Is the tool use trained with a variety of prompt styles?

I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.

canyon289 · on March 27, 2025

Ah I see where confusion might come.

I don't mean the exact prompt shouldn't matter, but I am saying that we noticed that these series of models picked up on tool call format quite readily in our various tests, which is what we express in the docs. We tested internally and I hope the independent BFCL results speak for themselves! All their code and evals are public fully public.

> I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.

This is absolutely true. I show this in a tutorial last year where Gemma2 is finetuned for a specific format, and with some targeted SFT it produces a json output more readily. https://www.youtube.com/watch?v=YxhzozLH1Dk

So this is all to say, Gemma is designed to be a great model for multiple types of users. If you want to use the "out of the box" weights with your own format, go ahead! We hope that makes it easier to integrate with whatever tooling you're using with minimal headache.

If you need specific performance on your bespoke format finetune the model to be your own! Finetuning is supported across many frameworks so pick whatever library you like best.

This is all to say we hope Gemma is flexible and usable for folks like yourself along a variety of dimensions. For myself I'm learning there's big interest in a specific prompt. Again can't thank you enough for the feedback here.

troupo · on March 27, 2025

> We feel this model excels at instructability which is why we're recommending bringing your own prompt!

Sigh Taps the sign:

--- start quote ---

To put it succinctly, prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:

- training set

- weights

- constraints on the model

- layers between you and the model that transform both your input and the model's output that can change at any time availability of compute for your specific query

- and definitely some more details I haven't thought of

"Prompt engineers" will tell you that some specific ways of prompting some specific models will result in a "better result"... without any criteria for what a "better result" might signify.

https://dmitriid.com/prompting-llms-is-not-engineering

--- end quote ---

canyon289 · on March 27, 2025

With open models this isn't as true. The weights are local, you bring your own compute, there's nothing between you and the model. Regarding what is a better result, personally I encourage you to define what a better result is in an evalset and then optimize against that. Agree that having no criteria is not a great situation to be in.

This was the main point in a tutorial I did about a month ago now showing how to make a simple AI app using gemma, though the principles hold for any LLM.

https://www.youtube.com/live/9zM_93mYdu8

Hope this helps!

troupo · on March 27, 2025

Even then the "You MUST" and "You SHOULD NOT" are just magical incantations that may (and will) randomly fail.

nyrikki · on March 27, 2025

In the field of AI wishful mnemonics is the rule, arguments about prescription are of limited value.

CoT scratch space extends LLMs from DLOGTIME-UNIFORM TC0 to PTIME with polynomial sized scratch space.

https://arxiv.org/abs/2502.02393

Yes prompt engineering is probably better though of as prompt augmentation or stearing.

But the systems identification problem and Rice's theorm rigorously debunk the above links core claims.

It is a craft that can improve domain specificity and usefulness.

All models are wrong (even formalized engineering ones), but some are useful.

The price one has to pay when resorting to what is fundamentally compression as PAC Learning is, that it is fundamentally unstable under perturbations.

You are basically searching through a hay stack with a magnet, and making sure that at least one of the needles you find is the correct one is a symantic property. Guiding the approximate retrieval process to improve your results will always be a craft.

The snake oil is mostly on the side that claims that unrestricted natural language is a possibility. We still only have NLP, and true human level NLU is still thought to be beyond the limits of computation IMHO.

Thus prompt augmentation is a consequence of the argument that link was trying to make.