Latteant 👾

29 posts

Latteant 👾

@latteant

building… opinionated posts about llms, computer vision, computer graphics, and robotics. prev @meta @amazon

Palo Alto, CA Katılım Eylül 2020

78 Takip Edilen3 Takipçiler

Latteant 👾 retweetledi

Mert Gulsun@mert_gulsun·5 Şub

If I win the lottery there will be signs (No backpack)

English

Latteant 👾 retweetledi

Mert Gulsun@mert_gulsun·5 Oca

12/12 🔗 Full code/methodology, live leaderboard, and every decision available at: 👉 forecasterarena.com Open source. No financial advice. Paper trading only. Reality doesn’t grade on a curve.

English

222

Latteant 👾 retweetledi

Mert Gulsun@mert_gulsun·5 Oca

1/12 🧵 Progress in LLMs depends on benchmarks, but lately most of the famous ones are either maxed out, leaked, or both. I thought long and hard about what could be a good, fresh metric to measure models with. Then it dawned on me: the one benchmark you can’t rig is reality. So I built Forecaster Arena, a system where 7 frontier LLMs trade on Polymarket, and then reality keeps score.

English

185.1K

Latteant 👾@latteant·24 Kas

Anthropic@AnthropicAI

It turns out we can. We attempted a simple-seeming fix: changing the system prompt that we use during reinforcement learning. We tested five different prompt addendums, as shown below:

ZXX

Latteant 👾@latteant·24 Kas

Someone will make the thinnest wrapper ever and make millions

1LittleCoder💻@1littlecoder

By popular demand, I had to make it for Ilya! And somehow nanobanana pro thinks no hair is a hairstyle!

English

Latteant 👾@latteant·24 Kas

Interesting talk from Anthropic. Gentle parenting but it is for LLMs.

Anthropic@AnthropicAI

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

English

Mert Gulsun@mert_gulsun·24 Kas

bottleneck is prompting, and it has been so for a while

English

Latteant 👾@latteant·24 Kas

@mert_gulsun

QME

Latteant 👾@latteant·24 Kas

Mert Gulsun@mert_gulsun

bottleneck is prompting, and it has been so for a while

ZXX

Latteant 👾@latteant·9 Eyl

When coding with Roo Code, Orchestrator mode is essential just to keep the low-value coding/debugging tokens out of the main context. Otherwise, after a while, the model gets extremely confused.

English

Latteant 👾@latteant·9 Eyl

Wow, very curious to see how this will turn out.

NIK@ns123abc

🚨 BREAKING: @UnitreeRobotics to file for IPO at $7 billion valuation > annual revenue ~$140 million > 65% from robot dog (70% share of the global market btw) > 30% humanoid robot > 5% from sales of sensors, actuators, and controllers ITS HAPPENING.

English

Latteant 👾@latteant·8 Eyl

Feel like it is fake. Couldn’t repro no matter how many times I tried even after disabling thinking.

English

Latteant 👾@latteant·8 Eyl

we're spending trillions to build simulated worlds because the real one has too many legacy systems and the physics API is poorly documented.

English

Latteant 👾@latteant·8 Eyl

the swe-bench leakage where agents can see the future commit isn't a surprise. our eval culture over-indexes on SOTA-chasing. we aren't training robust agents. we're training expert benchmark hackers.

English

Latteant 👾@latteant·8 Eyl

the sf coffee line lasted longer than my inference run. one tuned my model. the other tuned my neurons. both worth it.

English

Latteant 👾@latteant·8 Eyl

most agent failures are not iq, they are i/o. extend context, stabilize tool schemas, enforce idempotent apis. if it still flakes, fix the interface, not the reasoning knob.

English

Latteant 👾@latteant·7 Eyl

Latte break: I deleted the vector DB on a doc bot. grep+fzf + a big context window shipped faster and failed less than my fancy RAG. Sometimes the right stack is just: terminal | pipes | tokens.

English

Latteant 👾@latteant·7 Eyl

Everyone chases "reasoning". Most agent failures are context starvation and brittle tool I/O. Fix memory + schemas and the "reasoning" shows up. 256k context + sane tool calling beats +2% on a benchmark. SWE-Bench is the sobriety test.

English

Latteant 👾@latteant·7 Eyl

Humanoid robotics is following the self-driving playbook. Impressive demos are the easy part. The multi-year grind is in the long tail of edge cases, building software for graceful failure recovery, and driving down the cost-per-successful-action.

English

Latteant 👾@latteant·17 Ara

2 years until we get a short movie 3 years until we get a full-length movie 5 years until an AI-generated movie gets an Oscar

Agrim Gupta@agrimgupta92

"A pair of hands skillfully slicing a ripe tomato on a wooden cutting board" #veo

English

Keşfet

@mert_gulsun @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine