Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Replit's new AI Model now available on Hugging Face (replit.com)
220 points by todsacerdoti on Oct 11, 2023 | hide | past | favorite | 51 comments


I’ve been using mistral and code llama to generate large volumes of code recently, and I have to say…

These small models just suck compared to the larger ones.

It get it, they’re quick and cheap (ha! Relatively) to make and good for research and fine tuning…

…but can anyone here speak authoritatively on fine tuning and getting good results out?

I’ve been super disappointed by how bad even the q6 13B code llama model is at generating consistent code (Ie. It even compiles, forget doing what you asked) > about 30 lines in length.

These smaller model seem good for a line or two, maybe, but gosh… it’s an effort to do anything useful with them out of the box.

Carefully crafted prompt.

Tests, hand written.

Iterate: prompt, compile, run tests, generate code metrics. Accept code that passes the tests and beats the target threshold.

You’re looking at like 7-10 iterations per prompt to get anything for simple (< 30 lines) functions, and maybe no candidates after 30 iterations for longer complex requests.

Are people just using this for 5 line code snippets and autocomplete?

Or is there a way to get better results by fine tuning?


Absolutely not victim blaming, but how do you use it? The biggest issue with local LLM is prompt structure prompting (like how you feed the instructions, not how you say something). If you even deviate a small bit from how it was trained, you’ll get terrible results. I’ve been using codellama 13b + ollama + continue and I’ll be honest, it’s almost on par with GPT-3.5 for my stuff. It’s been amazing as a pair programmer. It’s better to make draft and bounce ideas with it than to ask it to start from scratch. Long story short, try ollama + continue. If you’re using llama cpp by itself, chances are, you’ll get bad results.


> Absolutely not victim blaming, but how do you use it?

It sounds like OP is trying to replace junior/mid-level SWEs with CodeLLMs where a detailed description of the desired solution goes in, and working code comes out - all hands-off.

If there is ever will be a time that LLMs can consistently achieve what OP wants, there will be a reckoning in software engineering. It's not like junior SWEs aren't already having a hard time with the current hiring environment.


That feels like damning with faint praise: we encourage Louie.ai users to only do GPT4+ level models for code gen related tasks. Even GPT4 has a lot of ways to go. Saying other models are only around 3.5 for this task isn't great. I'm hopeful for starcoder etc, but still not there yet afaict...

Agreed on prompts. We are doing a lot to guide it, and even autorepair loops. Likewise, keeping the interaction model to generating small code likewise helps the chance of any individual step being right and repairable..


Are you using apple silicon? How much RAM do you have, and how many tokens/second with codellama 13b?


Yes, 16GB of ram is needed for 13B, 32GB for 34B (both for 4bit). The first time it loads a new models takes some warm up time, I wanna say 30s? After that, the context reading and token generation are usually upward of 8 tk/s. Also, the newer and bigger the die, the faster the token generation. Like a Mac Studio would probably generate 30% or so faster than a MBP


8tk/s on 34b?

I've managed to run Codellama instruct 13b with my laptop's RTX 3070 (8gb VRAM) at 6tk/s by offloading 27 layers into the GPU with llama.cpp

I've been considering getting a macbook for running 34b+ LLM inference, but with the speed in which small LLMs are progressing, I think it is better to get a laptop with an RTX 4090 and 16gb vram. Maybe It can run 34b models by offloading layers into the GPU.


I only have a 16GB computer so I can’t confirm the 34B performance. I have a 3090 with 24GB of VRAM and 34B just fits and runs above 15 tk/s. If you want a laptop and only plan for inferencing, I think a MBP would be better than a 4090 laptop.


No warm up if you switch to metal with no ANE on sonoma


> It’s been amazing as a pair programmer.

...

> It’s better to make draft and bounce ideas with it than to ask it to start from scratch.

Mm. Look, I'm going to be brutally blunt here. In the long term, chat is an AI-anti-pattern.

You can't automate a prompt sequence when the Nth prompt is context dependent on the previous prompt.

"Write me XX" ... "No, fix this" ... "no, more like this" ... "I get this error" ... Cool. You get a result and it works.

...but how many interactions did you do to get that? 5? How long did it take? Did you even try 'regenerate answer' and look at some variations? Are you sure the first answer it gave you was the best one? I'm pretty sure it wasn't.

Anyway, ok, so now you have 50 functions you need to generate. Now you have 500. What's your plan? Same thing?

There are too many human touch points.

You know what AI superpower is? Automation. Repeatedly generating output, day in and day out. That's what computers are all about.

Don't get me wrong; the interactive style of AI copilot is lovely too, but it's just an incremental improvement on autocomplete, and I'm not interested; I already have autocomplete.

> how do you use it?

1) Every code function I want to generate, I create a scaffold that defines the exact function template, like:

    // Using these imports only
    import {x, y, z} from "./blah";

    /* What does foo do... */
    export function foo(a: number, b: number) { ... }
Every prompt goes into a `prompts` folder.

2) I create a test harness that defines a set of unit tests that define the behaviour of foo.

So, you can literally run: `npx jest ./output/foo.ts`

Every prompt has a matching `tests/foo.test.ts` test file.

(Yes, I know this sounds like a pain in the ass, it's less annoying when you scaffold tests out an LLM as well. It's not as bad as you might imagine once you get used to the workflow).

3) I process the prompts folder, and for every prompt generate a solution candidate:

- I extract the typescript from the markdown output, save it.

- I run `npx tsc --strict foo.ts --outDir dist` on it.

- If it fails, run a meta 'fix this typescript with these errors' prompt over it.

- I run the test suite on the result if it passes.

- If the test suite passes, I save the result as a candidate solution.

- If it fails, I vary the temperature and generate a new solution.

- Eventually if I don't get any candidate solutions, I log an error to revisit and refine the prompt.

Look, it's not magic, it's very simple:

LLMs generate code. sometimes the code is good, sometimes its not... but you can generate 10 or 20 different variations and it costs literally nothing except time. You just repeat it over and over and over; and maybe run some automated fixes on the outputs.

It works fine. I've made a raytracer with it, I've made a little card game with it. I'm building a website with it. Great stuff.

...if I use the openai api.

Now, the openai api sucks for lots of reasons, but the big one is that when you use the real AI superpower; ie. automation, it actually starts costing you a not insignificant amount of $$$.

So, I've been experimenting with using some offline models; specifically, as I said, code llama, and mistral. The best results I've had are from the q5, q6 codellama (1) 34B model, running using llama.cpp.

It's just slow.

So, I was experimenting with these smaller models, but... they're not that great for what I'm doing.

What you're doing, is not what I'm doing, and not quite I'm trying to do.

I get the "you're using it wrong" argument, yup. Fair enough. You're totally right. A lot of people get a lot of value from just having chatGPT open side-by-side with vscode. That's cool... but I'm specifically talking about my difficulties with a different use-case.

[1] - https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF


That’s the point I’m trying to make though, you’re not using it wrong in the sense like your application is wrong or you’re doing something dumb. By “wrong” I mean the structure you’re sending it is off. Like some local LLM use <INST></INST>, some use USER, Some use HUMAN. Miss a \n for a context and your results is garbage. That’s why I recommend using ollama instead of llama cpp directly because I have not been able to find a reliable way to define this. When you use llama cpp and just send a prompt, by default it directly sends what you send to llama cpp. Ollama has a layer that abstracts this away.

Please give Ollama a go! Would love to hear if it works out! Feel free to contact my email in my profile if you need some help.


Every model on hugging face defines the input context. For mistral it is "<s>[INST] ... [/INST]"

It's pretty obvious if you're writing prompts you have to use the correct prompt syntax.

? ollama seems unrelated to the problems I'm having.

There's no way you can define an arbitrary mapping between prompt formats where some have eg. SYSTEM and some don't. It's simply not possible. You have to update your prompts for different models.

I keep a separate list of prompts for each model. It's no big deal.


Ollama have prompts properly defined for you in the library.but back to OP that is the problem I am facing too even with ollama.


I'm guessing these small models are not meant to be used for writing whole blocks of code but rather to add more intelligent autocomplete for a few characters ahead, then they could probably provide a bit of help at least. I've had the same experience as you when trying anything locally below 30B parameters.


> Or is there a way to get better results by fine tuning?

It seems that the general wisdom around LLMs is that you can get very good performance on small models if you fine tune for a specific task. In the case of code generation, I think you might get a good performance by fine tuning it on a specific programming language + codebase or architectural pattern.

The main problem with fine tuning is getting a good dataset, so a cheap alternative would be to put a few examples of what you want to generate in the prompt. Then you would save these prompts as task specific "fine tunes" that you would select when you need to accomplish something.

You might find this discussion helpful: https://news.ycombinator.com/item?id=37813806


> It seems that the general wisdom around LLMs is that you can get very good performance on small models if you fine tune for a specific task

It would seem so, but is there any stories/research done that actually proves that someone has done so with good results? FOSS of course, so one could actually inspect there is no fudging and so on.


They're all pretty bad at replacing a developer, especially in lesser used languages

Python is their forte, JS is okay. Rust is a mess, they don't get borrowing.

The best so far was early chatgpt4 but it has since been nerfed down considerably.

It's good to get ideas and for writing algorithms you are too lazy to google and implement, not great at doing actual work.

Funnily enough, I think they'd do great at FANGs interviews


I feel your experience fits with what's described in the OP article in https://news.ycombinator.com/item?id=37830011 (the most upvoted discussion goes somewhere else)

I've had the same experience as you. Quantitative, these models are decent. Qualitative, in my daily work, they're not good at all! We'll need better tests.


I suspected this would happen. At the end of the day larger models have more to work with, this makes a big difference. Also, there's a lot of domain knowledge in GPT-4 which isn't code but surely makes a big difference when it comes to understanding what you want, and the context of the problem and the solution.


"Intended use" from their readme:

> Replit intends this model be used by anyone as foundational model for application-specific fine-tuning without strict limitations on commercial use.

> The model is trained specifically for code completion tasks.

Nice, I expected that I would need to give my E-Mail address to them and that it would be ""free"".


AIUI, one of the most notable limitations of the first version of this model was that it couldn't Fill In The Middle (FIM), it could only provide completion.[0]

This blog post doesn't mention FIM either, so I guess that's still missing? The demos I've seen of Replit Ghostwriter indicate that it is still possible to get good results without FIM, as long as you have good enough software around the model, but I think FIM could still improve things further.

The much smaller Refact-1.6B model supports FIM[1], and Refact-1.6B worked pretty well when I tested it a few weeks ago.

People (like the most upvoted comment in this thread) who are expecting any of these small models to write entire programs for them based on a simple prompt seem to misunderstand the purpose of these smaller models, which is to be a smarter alternative for code completion. Writing entire functions or programs is better suited to much larger (and slower) instruct/chat-tuned models.

[0]: https://huggingface.co/replit/replit-code-v1-3b/discussions/...

[1]: https://refact.ai/blog/2023/introducing-refact-code-llm/


> Encompasses Replit's top 30 programming languages with a custom trained 32K vocabulary for high performance and coverage

Any idea where the list can be found?


> The model is trained in bfloat16 on 1T tokens of code (~200B tokens over 5 epochs, including linear cooldown) for 30 programming languages from a subset of permissively licensed code from Bigcode's Stack Dedup V2 dataset and a dev-oriented samples from StackExchange.

Following the link to the "Stack Dedup V2" page: https://huggingface.co/datasets/bigcode/the-stack-dedup

> The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages. The full list can be found here.

https://huggingface.co/datasets/bigcode/the-stack-dedup/blob...

It requires login to see the JSON file.


just added the list to the README on Hugging Face!


I know a lot depends on architecture and number representation, but do people have a sense for how big a compute cluster is needed to train these classes of models from 1.5B, 3B, 7B, 13B, 70B?

Didn’t Meta say they trained on 2k A100s for LLama 2?


We're on a budget :) trained on 128 H100-80GB GPUs for a week (200B tokens over 5 epochs, ie 1T tokens).

Tech talk here with timestamp: https://www.youtube.com/live/veShHxQYPzo?si=UlcU9j2kC-C4oWvj...


Each H100 is ~$30,000, so $3.8M in capex cost.

Roughly $1/hr/GPU in power cost so looking at 128247 = $21,504.

Cheap compared to OpenAI, but not something an indiehacker can do by themselves unless they have millions to burn.


The Huggingface page of Replit 3Bs says "The model has been trained on the MosaicML platform on 128 H100-80GB GPUs."

Source: https://huggingface.co/replit/replit-code-v1_5-3b

I'm not an ML engineer, just interested in the space - but as a general ballpark, training these models from scratch needs hundreds to thousands of GPUs.


Nice, Apache 2.0 license. Thank you, Replit!

https://huggingface.co/replit/replit-code-v1_5-3b


So many code models seem to be used for code generation purposes, but is there a general effort to apply these as static/code analysis tooling? It'd be nice to write my rules in English and have somewhat predictable behavior when analyzing small bits of code. I have had success with GPT4 and a bit less with StarCoderPlus, but I have to build the engine to chop up code, send pieces to remote as needed, cache results for same-hashed snippets, etc.

Surely someone is working on general AI powered code analysis tooling?


The first version of the model said that infill was coming. I was hoping to see that in this version, but I guess we have to wait for v2.


Anyone knows how to get this working locally with vscode?



That is not what I consider "local", since that uses cloud inference by default (and last I checked, they provided no useful guidance for changing that).

I don’t consider cloud inference to count as getting it working “locally” as requested by the comment above yours.

Refact worked nicely and worked locally when I tried it a few weeks ago, but the challenge with any new model is making it be supported by the existing software: https://github.com/smallcloudai/refact/


"Choose your model Requests for code generation are made via an HTTP request.

You can use the Hugging Face Inference API or your own HTTP endpoint, provided it adheres to the API specified here[1] or here[2]."

It's fairly easy to use your own model locally with the plugin. You can just use the one of the community developed inference servers, which are listed at the bottom of the page, but here's the links[3] to both[4].

[1]: https://huggingface.co/docs/api-inference/detailed_parameter...

[2]: https://huggingface.github.io/text-generation-inference/#/Te...

[3]: https://github.com/wangcx18/llm-vscode-inference-server

[4]: https://github.com/wangcx18/llm-vscode-inference-server


I have the same question, and more generally: Any generic way of doing this for any of the open source or semi open source models, especially Mistral[0]?

[0] https://news.ycombinator.com/item?id=37675496



Any vibe checks on this model? How does it compare to gpt4 for coding?


This being a 3B model isn't remotely comparable to GPT4.

WizardCoder 34B and Phind 34B are the only models remotely comparable, and they are still slightly worse than GPT 3.5 (let alone GPT4).


How about Mistral 7B? I saw this article recently:

https://wandb.ai/byyoung3/ml-news/reports/Fine-Tuning-Mistra...


Mistral 7B is very cool for its size. But unfortunately no open model is close to GPT4 as of right now.


If the rumors around GPT4 being a mixture of expert models are true, the this comparison is not fair.

What would be interesting is compare GPT4 at a certain task with a small model fine tuned for that task.


GPT4 being a mixture of experts is irrelevant imo like we don't care about how many layers there are in a network and how wide those layers are or which type of activation functions are actually used etc. all that matters are we can run it on a specific hardware and the results.


Exactly. I don't get why people (non AI researchers) discount MoE like they are cheating or fake parameters.

Even if each inference pass only runs part of the network, there's still a trillion learnable parameters there lol.


But the thing is, it doesn't need to know much about "other stuff", just about code (and basic English instructions)

So comparing it with big models I'd say it's good but might have limited usefulness

(you can probably go further with 3B with only code)


The main feature I'm looking for is to train it on my own code also (in the UI it should have a switch).


Title should include "code generation language model".


> When fine-tuned on public Replit user code, the model outperforms models of much larger size such as CodeLlama7B:

The table just below this shows the other models doing better on half of the benchmarks; the Replit column being in boldface is misleading.


I don't think the boldface is meant to mean "better".

I just thought it was meant to draw attention to their numbers.


The standard is to bold the best figure per column. If none are significantly different you don't bold any generally but it's standard practice to use this to highlight which approach is best in each task.


Agreed this threw me.

I think colouring a column is the common approach to drawing attention to your own while still respecting the best is bold custom, which they've sort of done with the header, but personally I'd have gone with the cell background for the column.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: