Can AI coding assistants maintain effectiveness as codebases scale 100Γ? LoCoBench-Agent evaluates agents across 10K to 1M token contexts, spanning 8,000 scenarios in 10 programming languages and four difficulty tiers.
https://sforce.co/4txCDj4
π₯: Jielin Qiu & Huan Wang
#EnterpriseAI #AgenticAI
Posts by Salesforce AI Research
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation: bit.ly/48iccVY
On-policy distillation boosts accuracy but causes severe overconfidence. CaOPD uses a student-grounded empirical target for Pareto-optimal calibration.
Code: bit.ly/4cUtCKO
At #TDX26, Itai Asseo @iiitaiii.bsky.social on Enterprise General Intelligence: refining generic LLMs into reliable enterprise agents through AI Foundry and Agentforce Labs, including eVerse, the Learning Engine, and Agent Startup. https://sforce.co/3OrRYCb #FutureOfAI #EnterpriseAI
(7/7) Lost in Translation: Do LVLM Judges Generalize Across Languages?
Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman, Shafiq Joty, Enamul Hoque, Jimmy Huang
Accepted to #ACL2026
(6/7) Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction: bit.ly/4mQq0gj
Authors: Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty
Accepted to #ACL2026
(5/7) J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization: bit.ly/48egJZp
Authors: Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Accepted to #ACL2026
(4/7) Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math: bit.ly/4ofghQs
Authors: Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty
Accepted to #ACL2026
(3/7) From Passive Metric to Active Signal: A Survey on the New Paradigm of Uncertainty in Large Language Models: bit.ly/3O4NIrI
Authors: Jiaxin Zhang, Wendi Cui, Zhuohang Li, Lifu Huang, Brad Malin, Caiming Xiong, Chien-Sheng Wu
Accepted to #ACL2026
(2/7) GTA: Generating Long-horizon Tasks for Web Agents at Scale
Authors: Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Chien-Sheng Wu
Accepted to #ACL2026
(1/7) We have 6 papers accepted to ACL 2026, advancing work across web agent evaluation, LLM reasoning verification, uncertainty quantification, long-context efficiency, and multilingual judge systems.
ACL 2026 takes place July 2-7 in San Diego, California.
#ACL2026 #FutureOfAI #EnterpriseAI
That's a wrap on #TDX26! The @salesforce.com AI Research team was on the ground in San Francisco, connecting with the community and sharing the latest from our labs. From research demos to conversations about what's next in #EnterpriseAI, it was great to be part of the energy!
Reliability over raw power. π¬ @aimagazine.bsky.social features Silvio Savarese on why enterprise AI value comes from integrated systemsβnot bigger modelsβand how AI Foundry is building that foundation. https://bit.ly/41z0G4E
#SystemLevelAI #AIFoundry
AI Foundry: Turning foundational research into enterprise AI products faster. Silvio Savarese and Itai Asseo in TechFinitive: bit.ly/4ccYmFy
#FutureOfAI #EnterpriseAI #AgenticAI
The model wars are over. Enterprise AI success now lives at the system level. Silvio Savarese and Itai Asseo in @technologymag.bsky.social: bit.ly/4ttSBu4
#FutureOfAI #EnterpriseAI #AgenticAI
Ambient intelligence is moving from research into live sales and service workflows. Silvio Savarese discusses with @cxtoday.com: bit.ly/47DUhJ1
#FutureOfAI #EnterpriseAI #AgenticAI
Three agentic AI trends shaping the enterprise through 2027. Silvio Savarese and Itai Asseo discuss AI Foundry with @cio.com: bit.ly/4sg6iMf
#FutureOfAI #EnterpriseAI #AgenticAI
One demo. Reliable replay. No cloud calls. GPA turns a single recorded workflow into deterministic desktop automation, entirely on-device.
π Explore GPA: bit.ly/48r7Onp
π Read the blog: sforce.co/4sYdhu8
#EnterpriseAI #GUIAutomation
Why #EnterpriseAI demands a shift from models to systems. Itai Asseo discusses AI Foundry with @diginomica.com: bit.ly/4scZwGT
#FutureOfAI #AgenticAI
From One Demo to Reliable Automation: How GPA Reimagines GUI Process Automation https://sforce.co/4sYdhu8
Show it a workflow once. GPA replays it reliably, locally, and without brittle scripts to maintain.
#FutureOfAI #EnterpriseAI #AgenticAI #GUIAutomation
(5/5) βοΈ Authors: Jielin Qiu, Zixiang Chen, Liangwei Yang, @mingzhu0527.bsky.social, Zhiwei Liu, Juntao Tan, @wenting088.bsky.social, Rithesh Murthy, Roshan Ram,
@aksh555.bsky.social, @shelbyhai.bsky.social, @caimingxiong.bsky.social, Silvio Savarese, Huan Wang
#FutureOfAI #EnterpriseAI
(4/5) ποΈ Using Deepgram, vLLM, and ElevenLabs, the team hit a P50 time-to-first-audio of 947ms (best case 729ms) β ~17Γ faster than native speech-to-speech. Full 9-chapter tutorial with working code π
(3/5) β‘ The key insight: "realtime" isn't one fast model. It's streaming + pipelining across components. A cascaded STT β LLM β TTS pipeline where each stage streams output to the next achieves sub-1-second response.
(2/5) π Native speech-to-speech models like Qwen2.5-Omni produce quality audio but are too slow for realtime (~13s time-to-first-audio) and don't support function calling β a must for enterprise agents.
(1/5) ποΈ Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial
Paper: bit.ly/4seq7Ee
25+ open-source speech-to-speech models exist, but none shows how to build a complete streaming voice agent with function calling.
5/5 βοΈ Jielin Qiu, Liangwei Yang, @mingzhu0527.bsky.social, @wenting088.bsky.social, Zhiwei Liu, Juntao Tan, Zixiang Chen, Roshan Ram, @aksh555.bsky.social, Rithesh Murthy, @shelbyhai.bsky.social, @caimingxiong.bsky.social, Silvio Savarese, Huan Wang
#FutureOfAI #EnterpriseAI
4/5 π Tested on 50 insurance products across 10 categories with 2,490 FAQs, 290 coverage details, and 162 pricing tiers. Domain-agnostic and adaptable to any enterprise sales environment.
3/5 β‘ 2.8-second mean response time with 100% question detection, a 14Γ speedup over manual search. Cross-product comparisons see the biggest gains at 23Γ.
2/5 π The system streams live audio through speech-to-text, detects customer questions via LLM, then retrieves answers using hybrid FAQ matching and text-to-SQL over a structured product database.