Motoki Wu 🌊🦈

1.4K posts

Motoki Wu 🌊🦈 banner
Motoki Wu 🌊🦈

Motoki Wu 🌊🦈

@plusepsilon

Resisting the urge to finetune @ Salient. ex-@cresta and other doodlings.

Oakland, CA Присоединился Ekim 2013
3.3K Подписки829 Подписчики
Закреплённый твит
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
All this compute power for language models but most of our knowledge are in screenshots.
English
0
1
4
0
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
2026 will be the year of Claude Code UIs.
English
0
0
0
60
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
Coding agents are so good now that you should be using it for everything. Text classification? Coding agent RAG? Coding agent Prompt optimization? Coding agent Recipe for next week's Christmas potluck? Coding agent
English
0
0
1
76
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
The only late night parties in SF are labeling parties 🥳.
English
0
0
0
85
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
If RL needs pre-training level scaling, are we admitting that RL learns really slow...
English
0
0
2
138
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
OpenAI chose Windsurf because Ctrl+N doesn't delete your prompt :P
English
0
0
1
152
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
Agents for recall, workflows for precision.
English
3
0
5
223
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
Post training making a comeback. MLEs can finally breathe.
English
0
0
0
181
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
@Thom_Wolf My favorite was τ-bench (arxiv.org/abs/2406.12045). It evaluates on states, so the trajectory of the conversation and the agent's implementation matter less, and there's no need to rely on LLM evals. It strongly influenced how I structure agent workflows.
English
0
0
2
90
Thomas Wolf
Thomas Wolf@Thom_Wolf·
What was the most impactful/visible/useful release on evaluation in AI in 2024?
English
12
1
23
11.4K
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
There's no reason to FOMO over a new model when you can wait one month and FOMO over a better model :)
English
1
0
8
226
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
Can o1 at least help us with Pydantic upgrades.
English
1
1
0
242
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
Now we can take coffee breaks during test time!
English
0
0
1
155
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
Some Bad: * you pretty much have to restart PyCharm if something goes wrong * it’s the only thing that makes the fans spin up on my M1 * not Cursor 😛
English
0
0
0
75
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
The Good: * strong autocomplete * codebase understanding with jump-to-definition * jump to previous cursor position * interactive tables * stable enough for remote notebooks Essentially, having all the IDE goodies in a notebook keeps me in the flow.
English
1
0
0
100
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
After 10(?) years, I think I finally found a better alternative to just running plain Jupyter notebooks: PyCharm's notebook integration. I've been using it off and on for awhile, but it's reached to a point now I can wholeheartedly recommend.
English
2
0
0
152
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
I just implemented agentic RAG. ☺️
Motoki Wu 🌊🦈 tweet media
English
1
0
1
175
Motoki Wu 🌊🦈
Motoki Wu 🌊🦈@plusepsilon·
@abacaj I've found loss masking to work well, but it does lower the number of tokens that the model sees. So you may have to compensate for that (more epochs / higher batch size).
English
0
0
1
81
anton
anton@abacaj·
Couple of questions that I haven't seen answered yet: 1. Why did stanford alpaca team set instructions to -100, forcing the model to condition on outputs? 2. Why do I see a better model (including proper validation loss) training on full sample, including instructions?
English
5
0
15
2K
anton
anton@abacaj·
Why would SFT show increasing validation loss and still an "improving" model? Two runs, same hparams. Only difference? Training on "inputs". In the original stanford alpaca they set the instructions to -100 to condition the model on the outputs But removing this line, shows a steady validation loss (blue line) - and a model that performs better in my benchmarks... What gives?
anton tweet media
English
10
3
49
27.6K