Posts by Veselin Raychev
We are also calling for community contributions to BaxBench on our GitHub!
🏆 Website & Leaderboard: baxbench.com
💾 Code Repository: github.com/logic-star-ai/…
📄Paper: arxiv.org/abs/2502.11844
We evaluate Claude 3.7 with 64k thinking tokens on BaxBench, and find that it now tops our leaderboard with 38% correct and secure generation rate. But when instructing the models with security specifications OpenAI o1 is again the best model.
LLMs are great at generating code, but the real test is creating production-ready applications. With BaxBench we tried to answer the question how often functionally correct app backends are generated and how often they contain security vulnerabilities.
BaxBench.com - led by @markvero.bsky.social
How to effectively fix vulnerabilities in code.
1 have the scanner confirm if it is fixed. Not just LLM hallucinations
2 have a fast scanner that can be used in Delta debugging to check what lines are affecting the results
3 all working in the IDE speed
snyk.co/uhJ48
The new bggpt is here. Based on Gemma2. The large 27B model is on par with gpt4o with gpt4o used as a judge.
models.bggpt.ai/blog/
I think there are people here, but not so much content. So, getting the good content as much as we can put
Our continuous pretraining method for LLMs that reduces forgetting from the base model was presented last week at EMNLP. Soon, some really strong models are coming.
arxiv.org/abs/2407.08699