FleetingBits

9.9K posts

FleetingBits banner
FleetingBits

FleetingBits

@fleetingbits

sf thinkcat - https://t.co/LeSsJ4ohsP

emoticat Katılım Eylül 2023
683 Takip Edilen2.5K Takipçiler
Sabitlenmiş Tweet
FleetingBits
FleetingBits@fleetingbits·
FleetingBits tweet media
ZXX
7
5
56
17.7K
FleetingBits
FleetingBits@fleetingbits·
@morqon okay i feel like this should already be true and should have been since opus 4?
English
1
0
1
35
morgan —
morgan —@morqon·
inside openai, by end of march: (1) for any technical task, the tool of first resort for humans is interacting with an agent rather than using an editor or terminal (2) the default way humans utilize agents is explicitly evaluated as safe, but also productive enough that most workflows do not need additional permissions
Greg Brockman@gdb

Software development is undergoing a renaissance in front of our eyes. If you haven't used the tools recently, you likely are underestimating what you're missing. Since December, there's been a step function improvement in what tools like Codex can do. Some great engineers at OpenAI yesterday told me that their job has fundamentally changed since December. Prior to then, they could use Codex for unit tests; now it writes essentially all the code and does a great deal of their operations and debugging. Not everyone has yet made that leap, but it's usually because of factors besides the capability of the model. Every company faces the same opportunity now, and navigating it well — just like with cloud computing or the Internet — requires careful thought. This post shares how OpenAI is currently approaching retooling our teams towards agentic software development. We're still learning and iterating, but here's how we're thinking about it right now: As a first step, by March 31st, we're aiming that: (1) For any technical task, the tool of first resort for humans is interacting with an agent rather than using an editor or terminal. (2) The default way humans utilize agents is explicitly evaluated as safe, but also productive enough that most workflows do not need additional permissions. In order to get there, here's what we recommended to the team a few weeks ago: 1. Take the time to try out the tools. The tools do sell themselves — many people have had amazing experiences with 5.2 in Codex, after having churned from codex web a few months ago. But many people are also so busy they haven't had a chance to try Codex yet or got stuck thinking "is there any way it could do X" rather than just trying. - Designate an "agents captain" for your team — the primary person responsible for thinking about how agents can be brought into the teams' workflow. - Share experiences or questions in a few designated internal channels - Take a day for a company-wide Codex hackathon 2. Create skills and AGENTS[.md]. - Create and maintain an AGENTS[.md] for any project you work on; update the AGENTS[.md] whenever the agent does something wrong or struggles with a task. - Write skills for anything that you get Codex to do, and commit it to the skills directory in a shared repository 3. Inventory and make accessible any internal tools. - Maintain a list of tools that your team relies on, and make sure someone takes point on making it agent-accessible (such as via a CLI or MCP server). 4. Structure codebases to be agent-first. With the models changing so fast, this is still somewhat untrodden ground, and will require some exploration. - Write tests which are quick to run, and create high-quality interfaces between components. 5. Say no to slop. Managing AI generated code at scale is an emerging problem, and will require new processes and conventions to keep code quality high - Ensure that some human is accountable for any code that gets merged. As a code reviewer, maintain at least the same bar as you would for human-written code, and make sure the author understands what they're submitting. 6. Work on basic infra. There's a lot of room for everyone to build basic infrastructure, which can be guided by internal user feedback. The core tools are getting a lot better and more usable, but there's a lot of infrastructure that currently go around the tools, such as observability, tracking not just the committed code but the agent trajectories that led to them, and central management of the tools that agents are able to use. Overall, adopting tools like Codex is not just a technical but also a deep cultural change, with a lot of downstream implications to figure out. We encourage every manager to drive this with their team, and to think through other action items — for example, per item 5 above, what else can prevent a lot of "functionally-correct but poorly-maintainable code" from creeping into codebases.

English
7
7
136
22.6K
FleetingBits
FleetingBits@fleetingbits·
@ChaseBrowe32432 i guess it’s unclear what tooling they mean, but they do say the llms crush the benchmark with scaffolding
FleetingBits tweet media
English
0
0
0
18
Chase Brower
Chase Brower@ChaseBrowe32432·
@fleetingbits they say this, which every detractor is using as "evidence" that LLMs just memorize and can't make use of high-level problem solving strategies or learning in-rollout.
Chase Brower tweet mediaChase Brower tweet media
English
1
0
2
33
Chase Brower
Chase Brower@ChaseBrowe32432·
Opus 4.6 in webui can solve even the "extremely hard" problems btw, not sure what their precise methodology was but they must have heavily hamstrung the models.
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
9
6
89
15.7K
FleetingBits
FleetingBits@fleetingbits·
@ChaseBrowe32432 which is also what they say? that it can solve basically everything with minimal scaffolding
English
1
0
0
19
FleetingBits
FleetingBits@fleetingbits·
@cursor_ai @srush_nlp given that a lot of people will open this on mobile (since this is twitter), you basically need to have a mobile landing page that explains the new interface
English
1
0
42
7K
Cursor
Cursor@cursor_ai·
We're also sharing an early alpha of our new interface. cursor.com/glass
Cursor tweet media
English
99
105
1.7K
422K
Cursor
Cursor@cursor_ai·
Composer 2 is now available in Cursor.
Cursor tweet media
English
499
807
8.8K
3.3M
FleetingBits
FleetingBits@fleetingbits·
@KeyTryer i think that is the challenge though; it’s very hard to create ood benchmarks because everything that matters is in distribution; so, i don’t read it as llm capabilities weak, but as (maybe) this is a good agentic, robustness, icl benchmark
English
1
0
3
93
Key 🗝 🦊
Key 🗝 🦊@KeyTryer·
@fleetingbits They don't have to target them because the model provably can just learn these things if it becomes necessary. This is like asking someone to mentally translate two entirely different paradigms relying on memory and basic rules and without any research.
English
1
0
4
255
rank-1
rank-1@rankdim·
@fleetingbits my recent claude interactions are more gossips lol using kimi code more and more
rank-1 tweet media
English
1
0
2
93
FleetingBits
FleetingBits@fleetingbits·
i spent $800 so far this month on claude code : O
English
6
0
17
1.5K
FleetingBits
FleetingBits@fleetingbits·
@Algon_33 yeah, i agree i think it’s easily possible to hit $15,000 a month, but it is still probably worth it
English
0
0
1
40
Name can't be blank
Name can't be blank@Algon_33·
@fleetingbits If a good engineer worth >$20k/month spends $1k/month for like a 5-20% speedup, that seems worth it.
English
1
0
1
52
FleetingBits
FleetingBits@fleetingbits·
@kagglingdieter have you written up your thoughts on the "Predicting Molecular Properties" competition? would be very curious as to how you tried to work in a domain that you don't know about, and which might be a bit different from other ml tasks, at the execution level?
English
0
0
0
7
Dieter
Dieter@kagglingdieter·
Looking back at the 50, these 5 wins taught me the most about what it actually takes to win at the highest level: AIMO 2 (1st): A brutal test of LLM reasoning and system complexity. This was a true team effort to push the boundaries of what these models can actually calculate. Google Landmark Recognition (1st - Solo): Scaling CV to 81,000 classes entirely on my own. This was as much a data engineering challenge as it was a modeling one. ASL Fingerspelling (1st): One of my favorites. We had to balance creativity with extreme inference efficiency to make AI that actually works for accessibility. Predicting Molecular Properties: Proof that you don't need to be a chemist to contribute to science. This taught me how to find the signal in a completely new domain. Bengali.AI (1st): My first Solo Gold to reach GM. It showed me how much "domain-aware" architectures can beat raw compute.
English
2
2
34
2.1K
Dieter
Dieter@kagglingdieter·
50 Kaggle Gold Medals. 🏅 It’s hard to believe, but I hit a milestone that seemed impossible when I first joined @kaggle. 50 Golds.
Dieter tweet media
English
25
36
1.2K
33.7K
FleetingBits
FleetingBits@fleetingbits·
@mjbommar this would be a good thing to game out; maybe you can do it the same way, in terms of copyright to the environment and then licence to use the copyright requiring the model be open sourced?
English
1
0
1
28
Michael Bommarito
Michael Bommarito@mjbommar·
but in the US at least, Title 17 is fairly quiet about data and other "works" of that ilk. see, e.g., the whole history of Creative Commons. case law is somewhere between lacking and even weaker than protecting softwaer works. so in the absence of clear statutory changes, there is no traditional, protectable economic incentive
English
1
0
1
21
FleetingBits
FleetingBits@fleetingbits·
some thoughts on openai acquiring astral 1) this follows the anthropic acquisition of bun; the labs want to incorporate companies that provide essential tooling to software developers 2) the hyperscalers supported open source projects because they were a common infrastructure layer that they all shared, also kind of employee comp 3) the labs seek to commercialize software development itself and so these projects have value to them, they have the distribution to sell them 4) the fact that these companies are small is an advantage, because it is easier to absorb a few cracked developers into a project than a large team 5) this is part of a broader trend where we should expect to see the labs take over important open source software projects 6) this is because when the labs are selling coding and they become responsible for the quality of their supply chain, in terms of security and interoperability
OpenAI Newsroom@OpenAINewsroom

We've reached an agreement to acquire Astral. After we close, OpenAI plans for @astral_sh to join our Codex team, with a continued focus on building great tools and advancing the shared mission of making developers more productive. openai.com/index/openai-t…

English
2
0
24
1.4K
FleetingBits
FleetingBits@fleetingbits·
@FakePsyho you sort of have to go through all the predictions though
FleetingBits tweet media
English
1
0
0
102
Psyho
Psyho@FakePsyho·
Seems that AI 2027 (ridiculed for "impossible" timelines) severely underestimated the speed of progress in late 2025 / early 2026: - AI coding agents have a much greater impact than the projected speedups - OpenAI alone already matched the revenue estimate two months earlier ($25B in Feb); if we combine revenue from all frontier labs, we've probably already matched the Jan 2027 estimate ($55B) I wouldn't be that much surprised if the authors revert to their original timelines at some point
Psyho tweet media
English
28
32
448
28.6K