aj

1.5K posts

aj banner
aj

aj

@anndvision

postdoc @Columbia . member of @blei_lab . phd @UniofOxford . prev @OATML_Oxford , @PVG_McGill , intern @Meta . he / they

nyc Katılım Mart 2011
654 Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
aj
aj@anndvision·
new preprint "ReLU to the Rescue: Improve your On-policy Actor-Critic with Positive Advantages" shockingly simple changes to A3C can give a cautious RL algorithm more effective than PPO in some settings, just adding a ReLU is enough! arxiv.org/abs/2306.01460
aj tweet media
English
2
16
88
41.2K
aj retweetledi
TensorZero
TensorZero@TensorZero·
We’re building TensorZero Autopilot, an automated AI engineer that analyzes LLM observability data, optimizes prompts and models, sets up evals, and runs A/B tests. It dramatically improves the performance of LLM agents on every single benchmark we’ve tried. Read more below.
TensorZero tweet media
English
1
6
34
8.1K
aj
aj@anndvision·
aj tweet media
ZXX
0
0
0
150
aj retweetledi
TensorZero
TensorZero@TensorZero·
🗞️ [Blog Post] Bandits in your LLM Gateway: Improve LLM Applications Faster with Adaptive Experimentation (A/B Testing) • Experimentation (A/B testing) with production traffic is the most reliable way to identify the best prompts and models for your task, but traditional approaches have significant limitations: you must either fix the experiment length in advance (risking wasted data or inconclusive results) or repeatedly check for significance (inflating error rates through p-hacking). • TensorZero now provides adaptive experimentation directly in its open-source LLM gateway. This multi-armed bandit algorithm overcomes the p-hacking problem, running experiments precisely until there’s enough evidence to pick a winner while dynamically allocating LLM inference traffic for maximum efficiency. • Across a diverse set of realistic and challenging environments, adaptive experimentation reduced the average time to correctly identify the best LLM variants (prompts, models, etc.) by 37% compared to simple A/B testing. Read more ↓
English
1
1
6
1.9K
TensorZero
TensorZero@TensorZero·
Shuyang Li previously was a staff software engineer at Google focused on next-generation search infrastructure, LLM-based search, and many other specialized search products (local, travel, maps, etc.). Before that, he worked on ML/analytics products at Palantir and graduated summa cum laude from Notre Dame. Welcome to the team, @_shuyang_!
TensorZero tweet media
English
1
1
7
752
aj
aj@anndvision·
or during value estimation, matter of fact
English
0
0
0
104
aj
aj@anndvision·
changing tab from path completion to a think toggle in claude code is wild
English
0
0
0
123
aj
aj@anndvision·
algorithms have been doing me dirty all week, this one's quality
aj@anndvision

is reinforcement fine tuning worth it? @OpenAI's RFT can be 700x more expensive than SFT and has stricter content moderation i tested it on data extraction, agentic coding, and customer service to find out 🧵

English
0
0
2
256
aj
aj@anndvision·
i ran this using @tensorzero's open-source stack 💾github.com/tensorzero/llm… includes: • programmatic sft/rft workflows • llm grader configs • evaluation methodology
English
1
0
1
169
aj
aj@anndvision·
is reinforcement fine tuning worth it? @OpenAI's RFT can be 700x more expensive than SFT and has stricter content moderation i tested it on data extraction, agentic coding, and customer service to find out 🧵
aj tweet media
English
1
3
4
658
aj
aj@anndvision·
agentic coding (terminal-bench): RFT wins here it improved performance where SFT failed (at a 241x cost premium) if you're building agents that benefit from reasoning and have the budget, this might be your use case
aj tweet media
English
0
0
0
39
aj
aj@anndvision·
data extraction (CoNLL++ NER): RFT improves performance with 10 examples... but SFT on a larger dataset did better with: • 159x lower optimization cost • 11x cheaper inference • 3x faster responses
aj tweet media
English
1
0
0
47