I’ve been using mistral and code llama to generate large volumes of code recently, and I have to say…
These small models just suck compared to the larger ones.
It get it, they’re quick and cheap (ha! Relatively) to make and good for research and fine tuning…
…but can anyone here speak authoritatively on fine tuning and getting good results out?
I’ve been super disappointed by how bad even the q6 13B code llama model is at generating consistent code (Ie. It even compiles, forget doing what you asked) > about 30 lines in length.
These smaller model seem good for a line or two, maybe, but gosh… it’s an effort to do anything useful with them out of the box.
Carefully crafted prompt.
Tests, hand written.
Iterate: prompt, compile, run tests, generate code metrics. Accept code that passes the tests and beats the target threshold.
You’re looking at like 7-10 iterations per prompt to get anything for simple (< 30 lines) functions, and maybe no candidates after 30 iterations for longer complex requests.
Are people just using this for 5 line code snippets and autocomplete?
Or is there a way to get better results by fine tuning?
Absolutely not victim blaming, but how do you use it? The biggest issue with local LLM is prompt structure prompting (like how you feed the instructions, not how you say something). If you even deviate a small bit from how it was trained, you’ll get terrible results. I’ve been using codellama 13b + ollama + continue and I’ll be honest, it’s almost on par with GPT-3.5 for my stuff. It’s been amazing as a pair programmer. It’s better to make draft and bounce ideas with it than to ask it to start from scratch. Long story short, try ollama + continue. If you’re using llama cpp by itself, chances are, you’ll get bad results.
> Absolutely not victim blaming, but how do you use it?
It sounds like OP is trying to replace junior/mid-level SWEs with CodeLLMs where a detailed description of the desired solution goes in, and working code comes out - all hands-off.
If there is ever will be a time that LLMs can consistently achieve what OP wants, there will be a reckoning in software engineering. It's not like junior SWEs aren't already having a hard time with the current hiring environment.
That feels like damning with faint praise: we encourage Louie.ai users to only do GPT4+ level models for code gen related tasks. Even GPT4 has a lot of ways to go. Saying other models are only around 3.5 for this task isn't great. I'm hopeful for starcoder etc, but still not there yet afaict...
Agreed on prompts. We are doing a lot to guide it, and even autorepair loops. Likewise, keeping the interaction model to generating small code likewise helps the chance of any individual step being right and repairable..
Yes, 16GB of ram is needed for 13B, 32GB for 34B (both for 4bit). The first time it loads a new models takes some warm up time, I wanna say 30s? After that, the context reading and token generation are usually upward of 8 tk/s. Also, the newer and bigger the die, the faster the token generation. Like a Mac Studio would probably generate 30% or so faster than a MBP
I've managed to run Codellama instruct 13b with my laptop's RTX 3070 (8gb VRAM) at 6tk/s by offloading 27 layers into the GPU with llama.cpp
I've been considering getting a macbook for running 34b+ LLM inference, but with the speed in which small LLMs are progressing, I think it is better to get a laptop with an RTX 4090 and 16gb vram. Maybe It can run 34b models by offloading layers into the GPU.
I only have a 16GB computer so I can’t confirm the 34B performance. I have a 3090 with 24GB of VRAM and 34B just fits and runs above 15 tk/s. If you want a laptop and only plan for inferencing, I think a MBP would be better than a 4090 laptop.
> It’s better to make draft and bounce ideas with it than to ask it to start from scratch.
Mm. Look, I'm going to be brutally blunt here. In the long term, chat is an AI-anti-pattern.
You can't automate a prompt sequence when the Nth prompt is context dependent on the previous prompt.
"Write me XX" ... "No, fix this" ... "no, more like this" ... "I get this error" ... Cool. You get a result and it works.
...but how many interactions did you do to get that? 5? How long did it take? Did you even try 'regenerate answer' and look at some variations? Are you sure the first answer it gave you was the best one? I'm pretty sure it wasn't.
Anyway, ok, so now you have 50 functions you need to generate. Now you have 500. What's your plan? Same thing?
There are too many human touch points.
You know what AI superpower is? Automation. Repeatedly generating output, day in and day out. That's what computers are all about.
Don't get me wrong; the interactive style of AI copilot is lovely too, but it's just an incremental improvement on autocomplete, and I'm not interested; I already have autocomplete.
> how do you use it?
1) Every code function I want to generate, I create a scaffold that defines the exact function template, like:
// Using these imports only
import {x, y, z} from "./blah";
/* What does foo do... */
export function foo(a: number, b: number) { ... }
Every prompt goes into a `prompts` folder.
2) I create a test harness that defines a set of unit tests that define the behaviour of foo.
So, you can literally run: `npx jest ./output/foo.ts`
Every prompt has a matching `tests/foo.test.ts` test file.
(Yes, I know this sounds like a pain in the ass, it's less annoying when you scaffold tests out an LLM as well. It's not as bad as you might imagine once you get used to the workflow).
3) I process the prompts folder, and for every prompt generate a solution candidate:
- I extract the typescript from the markdown output, save it.
- I run `npx tsc --strict foo.ts --outDir dist` on it.
- If it fails, run a meta 'fix this typescript with these errors' prompt over it.
- I run the test suite on the result if it passes.
- If the test suite passes, I save the result as a candidate solution.
- If it fails, I vary the temperature and generate a new solution.
- Eventually if I don't get any candidate solutions, I log an error to revisit and refine the prompt.
Look, it's not magic, it's very simple:
LLMs generate code. sometimes the code is good, sometimes its not... but you can generate 10 or 20 different variations and it costs literally nothing except time. You just repeat it over and over and over; and maybe run some automated fixes on the outputs.
It works fine. I've made a raytracer with it, I've made a little card game with it. I'm building a website with it. Great stuff.
...if I use the openai api.
Now, the openai api sucks for lots of reasons, but the big one is that when you use the real AI superpower; ie. automation, it actually starts costing you a not insignificant amount of $$$.
So, I've been experimenting with using some offline models; specifically, as I said, code llama, and mistral. The best results I've had are from the q5, q6 codellama (1) 34B model, running using llama.cpp.
It's just slow.
So, I was experimenting with these smaller models, but... they're not that great for what I'm doing.
What you're doing, is not what I'm doing, and not quite I'm trying to do.
I get the "you're using it wrong" argument, yup. Fair enough. You're totally right. A lot of people get a lot of value from just having chatGPT open side-by-side with vscode. That's cool... but I'm specifically talking about my difficulties with a different use-case.
That’s the point I’m trying to make though, you’re not using it wrong in the sense like your application is wrong or you’re doing something dumb. By “wrong” I mean the structure you’re sending it is off. Like some local LLM use <INST></INST>, some use USER, Some use HUMAN. Miss a \n for a context and your results is garbage. That’s why I recommend using ollama instead of llama cpp directly because I have not been able to find a reliable way to define this. When you use llama cpp and just send a prompt, by default it directly sends what you send to llama cpp. Ollama has a layer that abstracts this away.
Please give Ollama a go! Would love to hear if it works out! Feel free to contact my email in my profile if you need some help.
Every model on hugging face defines the input context. For mistral it is "<s>[INST] ... [/INST]"
It's pretty obvious if you're writing prompts you have to use the correct prompt syntax.
? ollama seems unrelated to the problems I'm having.
There's no way you can define an arbitrary mapping between prompt formats where some have eg. SYSTEM and some don't. It's simply not possible. You have to update your prompts for different models.
I keep a separate list of prompts for each model. It's no big deal.
I'm guessing these small models are not meant to be used for writing whole blocks of code but rather to add more intelligent autocomplete for a few characters ahead, then they could probably provide a bit of help at least. I've had the same experience as you when trying anything locally below 30B parameters.
> Or is there a way to get better results by fine tuning?
It seems that the general wisdom around LLMs is that you can get very good performance on small models if you fine tune for a specific task. In the case of code generation, I think you might get a good performance by fine tuning it on a specific programming language + codebase or architectural pattern.
The main problem with fine tuning is getting a good dataset, so a cheap alternative would be to put a few examples of what you want to generate in the prompt. Then you would save these prompts as task specific "fine tunes" that you would select when you need to accomplish something.
> It seems that the general wisdom around LLMs is that you can get very good performance on small models if you fine tune for a specific task
It would seem so, but is there any stories/research done that actually proves that someone has done so with good results? FOSS of course, so one could actually inspect there is no fudging and so on.
I've had the same experience as you. Quantitative, these models are decent. Qualitative, in my daily work, they're not good at all! We'll need better tests.
I suspected this would happen. At the end of the day larger models have more to work with, this makes a big difference. Also, there's a lot of domain knowledge in GPT-4 which isn't code but surely makes a big difference when it comes to understanding what you want, and the context of the problem and the solution.
AIUI, one of the most notable limitations of the first version of this model was that it couldn't Fill In The Middle (FIM), it could only provide completion.[0]
This blog post doesn't mention FIM either, so I guess that's still missing? The demos I've seen of Replit Ghostwriter indicate that it is still possible to get good results without FIM, as long as you have good enough software around the model, but I think FIM could still improve things further.
The much smaller Refact-1.6B model supports FIM[1], and Refact-1.6B worked pretty well when I tested it a few weeks ago.
People (like the most upvoted comment in this thread) who are expecting any of these small models to write entire programs for them based on a simple prompt seem to misunderstand the purpose of these smaller models, which is to be a smarter alternative for code completion. Writing entire functions or programs is better suited to much larger (and slower) instruct/chat-tuned models.
> The model is trained in bfloat16 on 1T tokens of code (~200B tokens over 5 epochs, including linear cooldown) for 30 programming languages from a subset of permissively licensed code from Bigcode's Stack Dedup V2 dataset and a dev-oriented samples from StackExchange.
I know a lot depends on architecture and number representation, but do people have a sense for how big a compute cluster is needed to train these classes of models from 1.5B, 3B, 7B, 13B, 70B?
Didn’t Meta say they trained on 2k A100s for LLama 2?
I'm not an ML engineer, just interested in the space - but as a general ballpark, training these models from scratch needs hundreds to thousands of GPUs.
So many code models seem to be used for code generation purposes, but is there a general effort to apply these as static/code analysis tooling? It'd be nice to write my rules in English and have somewhat predictable behavior when analyzing small bits of code. I have had success with GPT4 and a bit less with StarCoderPlus, but I have to build the engine to chop up code, send pieces to remote as needed, cache results for same-hashed snippets, etc.
Surely someone is working on general AI powered code analysis tooling?
That is not what I consider "local", since that uses cloud inference by default (and last I checked, they provided no useful guidance for changing that).
I don’t consider cloud inference to count as getting it working “locally” as requested by the comment above yours.
Refact worked nicely and worked locally when I tried it a few weeks ago, but the challenge with any new model is making it be supported by the existing software: https://github.com/smallcloudai/refact/
"Choose your model
Requests for code generation are made via an HTTP request.
You can use the Hugging Face Inference API or your own HTTP endpoint, provided it adheres to the API specified here[1] or here[2]."
It's fairly easy to use your own model locally with the plugin. You can just use the one of the community developed inference servers, which are listed at the bottom of the page, but here's the links[3] to both[4].
I have the same question, and more generally: Any generic way of doing this for any of the open source or semi open source models, especially Mistral[0]?
GPT4 being a mixture of experts is irrelevant imo like we don't care about how many layers there are in a network and how wide those layers are or which type of activation functions are actually used etc. all that matters are we can run it on a specific hardware and the results.
The standard is to bold the best figure per column. If none are significantly different you don't bold any generally but it's standard practice to use this to highlight which approach is best in each task.
I think colouring a column is the common approach to drawing attention to your own while still respecting the best is bold custom, which they've sort of done with the header, but personally I'd have gone with the cell background for the column.
These small models just suck compared to the larger ones.
It get it, they’re quick and cheap (ha! Relatively) to make and good for research and fine tuning…
…but can anyone here speak authoritatively on fine tuning and getting good results out?
I’ve been super disappointed by how bad even the q6 13B code llama model is at generating consistent code (Ie. It even compiles, forget doing what you asked) > about 30 lines in length.
These smaller model seem good for a line or two, maybe, but gosh… it’s an effort to do anything useful with them out of the box.
Carefully crafted prompt.
Tests, hand written.
Iterate: prompt, compile, run tests, generate code metrics. Accept code that passes the tests and beats the target threshold.
You’re looking at like 7-10 iterations per prompt to get anything for simple (< 30 lines) functions, and maybe no candidates after 30 iterations for longer complex requests.
Are people just using this for 5 line code snippets and autocomplete?
Or is there a way to get better results by fine tuning?