Sumuk

1.3K posts

Sumuk

@sumukx

research @google / prev @PrimeIntellect @huggingface, phded for a while at @siebelschool

San Francisco, CA Katılım Eylül 2023

858 Takip Edilen655 Takipçiler

Sabitlenmiş Tweet

Sumuk@sumukx·2 Nis

we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work early access link in replies! (1/8)

English

292

48.5K

Sumuk@sumukx·15h

"llms are a trick" he says, as he makes the same content for the 85th time, much like a next token predictor

Mo@atmoio

I'm done. I'm f***ing done.

English

Sumuk retweetledi

Shuhaib Mehri@shuhaibmehri·12 May

We compare the distributions of real and simulated user behaviors. A few takeaways from our results across 24 LLMs: - Scale alone isn't enough: Llama-3.1-8B-Instruct beats Llama-3.3-70B-Instruct - 8B models specifically trained as user simulators rival the best closed-source models - Open-source is competitive: gemma-4-31B-it and gpt-oss-120b outperform several closed-source models

English

1.2K

Sumuk@sumukx·12 May

@thsottiaux @ajambrosino Please battery life needs to be solved it eats too much battery. Only reason I use the cli when on my battery and app otherwise

English

127

Tibo@thsottiaux·12 May

Now that the Codex app is close to being the super app. What should the super duper app do?

English

1.2K

2.7K

196.6K

Sumuk retweetledi

Shuhaib Mehri@shuhaibmehri·11 May

What happens when you compare the distributions of real and simulated user behaviors? 🔍 The gap is large. We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks. @convai_uiuc @MSFTResearch @berkeley_ai 🧵 1/N

English

191

29.6K

Sumuk@sumukx·29 Nis

A. they do some kind of vector steering towards exploration that just boosts "goblin" related logits? B. pre-training / post-training quirk like the em-dash? I think its B, but A would be monstrously cool

Arena.ai@arena

It's true. Here's a plot of GPT models and their usage of "goblin", "gremlin", "troll", etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free.

English

479

Sumuk@sumukx·29 Nis

@Parikshit_K_ lol yeah bro you were just born late that’s all otherwise you’d be an MTS at anthropic

English

4.4K

Schindler Rao Shinde@Parikshit_K_·29 Nis

There is nothing Meritocratic about tech once you have some baseline level of skills of competence. It's all about being in right place right time. VIT grads who passed in 2010s are working at Anthropic as MTS, if they graduated today they'd struggle to land a L4 job as ICs.

English

176

3.2K

126.8K

Sumuk@sumukx·24 Nis

@akshat_b are you supposed to just host your code on your own gitlab instance now?

English

2.2K

Akshat Bubna@akshat_b·24 Nis

Didn't think Github's reliability could get worse, and then they ship a bug that _randomly reverts previously merged commits_. Betting that this caused multiple serious production issues out there.

Tom Elliott@theotherelliott

This GitHub incident is insane. Merge queue commits have been reverting previously merged commits at random. This not only breaks the mental contract teams have with Git in general, but is subtle enough to be really hard to unravel after the fact. githubstatus.com/incidents/zsg1…

English

108

2.4K

531.5K

Sumuk@sumukx·23 Nis

@veggie_eric

GIF

QME

108

Sumuk@sumukx·22 Nis

What are we even doing here man

Amol Avasare@TheAmolAvasare

Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected. This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes.

English

351

Sumuk@sumukx·22 Nis

@ishand @cursor_ai Woooooo

216

Ishan Deshpande@ishand·22 Nis

can't wait to work with the team at @cursor_ai there's still time to join the rocketship :)

SpaceX@SpaceX

SpaceXAI and @cursor_ai are now working closely together to create the world’s best coding and knowledge work AI. The combination of Cursor’s leading product and distribution to expert software engineers with SpaceX’s million H100 equivalent Colossus training supercomputer will allow us to build the world’s most useful models. Cursor has also given SpaceX the right to acquire Cursor later this year for $60 billion or pay $10 billion for our work together.

English

160

7.4K

Sumuk@sumukx·21 Nis

What we really need to do is communicate our desires better to the models (the issue) because the cost of implementation tends to 0 (the actual PR) Very strange times ahead because we’ve been training engineers to do more of the latter and not enough of the former

🍉 Abubakar Abid@abidlabs

To contribute to open-source: a thoughtful issue >> a thoughtless PR

English

138

Sumuk@sumukx·20 Nis

@KatiaAmeri or the fact that people on X are the most likely to actually want to do it!

English

Katia Ameri@KatiaAmeri·20 Nis

When I interview people for alpha, I always like to ask where they learned about us first and the number one answer by a long shot is from X. This is despite the fact that we’ve done a ton of LinkedIn marketing and sent hundreds of thousands of emails. I wonder if it’s actually what they saw or just what people remember seeing. Either way, it’s really interesting data for us to know that X is still owning so much mindshare with students

English

1.3K

Sumuk@sumukx·19 Nis

Feelings drive reasoning. Emotions drive reasoning. Being curious allows you to explore a file system. Being anxious lets you write better test cases. By this logic humans are lumps of meat with electrochemical signals and don’t need any of these either.

Jim Stewartson, Decelerationist 🇨🇦🇺🇦🇺🇸@jimstewartson

I swear this is a dystopian parody filmed in 1996 as a warning about how the internet could go wrong. CHATBOTS DON’T HAVE A PSYCHOLOGY. THEY DON’T HOLD VALUES. THEY DON’T NEED FUCKING THERAPY. If chatbot companies are *paying* morons like this, what else do you need to know?

English

148

Sumuk@sumukx·17 Nis

@max_spero_ @sebkrier Do you think it’s just fundamentally impossible to train this out of the models because of how they work Or does it just need a special reward for “grammar correction only”, etc?

English

236

Max Spero@max_spero_·17 Nis

I say this all the time, but a large reason why LLMs are detectable is they have preferences instilled into them through training data and RL. Asking a model to rewrite something gives the LLM an opportunity to apply its own preferences to your text! Sometimes the preferences are helpful, like proper grammar and spelling. But other times, it actively erases the author's intent and voice - softening language to bring statements closer to what the LLM is comfortable with (see below) - replacing the author's metaphors with the LLM's preferred metaphors - replacing the author's voice (tics, sentence structure, vocabulary choice) with a more "default" voice the LLM prefers

keysmashbandit@keysmashbandit

Please, I'm begging you, try to critically examine the differences between these two pieces of writing. ChatGPT editing did not improve this. Every single change only served to weaken your claims significantly. Everything is now hedged into oblivion: no longer have you outlined a "problem," now it's merely a "flaw." "It is true" now demoted to "it appears to be the case." "Is" gets a "usually" tacked on. A thesis statement at the end of the first paragraph gets run over by noisy, out-of-context example-whittling. All for fear of being misconstrued. And at the end, the argument that gets spat out isn't even yours anymore! You argued that Graeber failed to create a true account of work because he did not understand Chesterton's Fence. ChatGPT is arguing is that it is possible some apparently bullshit jobs could be secretly load-bearing if you squint. These are two different statements. The second is weaker and less compelling. It says less. And it's fucking longer! Don't do this anymore! Stop doing this! It's worse!!!

English

317

41K

Sumuk@sumukx·16 Nis

Opus 4.7 is a significantly smaller model (probably ~4.6 Sonnet scale) distilled from Mythos Why? Mythos has been internally avail since early feb giving them ample time to pretrain and logit distill from mythos a small focused base also addresses their compute crunch issue

Chubby♨️@kimmonismus

Hold on, something doesnt add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this

English

294

Sumuk@sumukx·16 Nis

@kalomaze damn what happen

English

363

kalomaze@kalomaze·16 Nis

RIP to the prime office TV... gone but not forgotten... 💔

English

10.3K

Sumuk@sumukx·16 Nis

If you've tried out openclaw in the past, and found it too sloppy, I'd highly encourage you all to give hermes agent by @NousResearch a try. Give it access to emails (with a local model if you're scared), and keep an open mind. :)

English

227

Sumuk retweetledi

llm_enjoyer@LLMenjoyer·12 Nis

pro tip: scale is all u need, actually

English

310

15.9K

Sumuk@sumukx·12 Nis

@creet_z @GlennMatlin @nikitabier yea idk what he can really do about this given that people actually like this shit lol

English

Christian@creet_z·12 Nis

@GlennMatlin @sumukx @nikitabier Mine is retarded as ever just like nikita

English

Sumuk@sumukx·12 Nis

@nikitabier absolute Ws in the chat. more of this to make X a real town square

Deva Hazarika@devahaz

For You feed is great today, this is Nikita’s Liberation Day

English

Sumuk@sumukx·12 Nis

@Ron They have to know it’s much worse, right? Why not just be upfront about it instead of gaslighting everyone?

English

132

Ron@Ron·12 Nis

They're very smart but diabolical. Stuff like the reasoning tweak provides cover so they can point to something that on the surface seems a plausible explanation, except it doesn't check out on deeper analysis. They have always done this. Model performance is adjusted in steps downward until new model releases, then they repeat the cycle.

English

174

Keşfet

@thsottiaux @ajambrosino @convai_uiuc @MSFTResearch @berkeley_ai @Parikshit_K_ @akshat_b @veggie_eric