alexusadays retweetledi
alexusadays
773 posts

alexusadays
@alexusadays
Learn Quality Assurance from Scratch https://t.co/fNTFKJVQh3
Katılım Aralık 2022
629 Takip Edilen159 Takipçiler
alexusadays retweetledi
alexusadays retweetledi

🤯BREAKING: Alibaba just proved that AI Coding isn't taking your job, it's just writing the legacy code that will keep you employed fixing it for the next decade. 🤣
Passing a coding test once is easy. Maintaining that code for 8 months without it exploding? Apparently, it’s nearly impossible for AI.
Alibaba tested 18 AI agents on 100 real codebases over 233-day cycles. They didn't just look for "quick fixes"—they looked for long-term survival.
The results were a bloodbath:
75% of models broke previously working code during maintenance.
Only Claude Opus 4.5/4.6 maintained a >50% zero-regression rate.
Every other model accumulated technical debt that compounded until the codebase collapsed.
We’ve been using "snapshot" benchmarks like HumanEval that only ask "Does it work right now?"
The new SWE-CI benchmark asks: "Does it still work after 8 months of evolution?"
Most AI agents are "Quick-Fix Artists." They write brittle code that passes tests today but becomes a maintenance nightmare tomorrow. They aren't building software; they're building a house of cards.
The narrative just got honest: Most models can write code. Almost none can maintain it.

English
alexusadays retweetledi

Amazon is holding a mandatory meeting about AI breaking its systems. The official framing is "part of normal business." The briefing note describes a trend of incidents with "high blast radius" caused by "Gen-AI assisted changes" for which "best practices and safeguards are not yet fully established." Translation to human language: we gave AI to engineers and things keep breaking?
The response for now? Junior and mid-level engineers can no longer push AI-assisted code without a senior signing off. AWS spent 13 hours recovering after its own AI coding tool, asked to make some changes, decided instead to delete and recreate the environment (the software equivalent of fixing a leaky tap by knocking down the wall). Amazon called that an "extremely limited event" (the affected tool served customers in mainland China).

English

@elonmusk @karpathy @farzyness Thank god, when we can start receiving our UBI and figure out our real purpose 😂
English

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.
This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.:
- It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work.
- It found that the Value Embeddings really like regularization and I wasn't applying any (oops).
- It found that my banded attention was too conservative (i forgot to tune it).
- It found that AdamW betas were all messed up.
- It tuned the weight decay schedule.
- It tuned the network initialization.
This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism.
github.com/karpathy/nanoc…
All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.
And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

Is AI interactive content just gaming? And a very small sub-stack of it?
My biggest question lately and leaning more and more towards yes after what I see at GDC on day 1.
#gdc

English
alexusadays retweetledi

I held a meeting on the situation in the Middle East and the Gulf region – the challenges for Ukraine and our partners, as well as our capacity to help protect lives, prevent the war from expanding, and stabilize global markets.
The Iranian regime, which is striving to survive at any cost, poses clear threats to all states in the region and to global stability. No country close to Iran can feel secure. Shipping through the Strait of Hormuz has practically stopped. So far, the Iranian regime has shown no genuine intent for honest diplomacy or fundamental change.
Ukraine is consulting with partners in Europe and the United States, as well as with countries neighboring Iran. Yesterday, I spoke with the leaders of the UAE and Qatar. Today, I held discussions with the leaders of Jordan and Bahrain. There will also be talks with Kuwait and other countries in the region. All of them face a serious challenge and speak openly about it: Iranian attack drones are the same “shaheds” that have been striking our cities, villages, and our Ukrainian infrastructure throughout this war.
In just a few days, Iran has launched over 800 missiles of various types and more than 1,400 attack drones. It is Iranian drones and missiles that pose the main threat to free navigation, destabilizing global prices for oil, petroleum products, and gas.
Ukraine can contribute to protecting lives and stabilizing the situation. Partners are reaching out to us. I have tasked Ukraine’s Minister of Foreign Affairs, together with intelligence agencies, the Minister of Defense, our military command, and the NSDC Secretary, to present options for assisting the relevant countries and to provide aid in a way that does not weaken our own defense here in Ukraine.
Our military possesses the necessary capabilities. Ukrainian experts will operate on-site, and teams are already coordinating these efforts. And we are ready to help protect lives, defend civilians, and support real efforts to stabilize the situation and, in particular, restore safe navigation in the region.
We expect the European Union, European countries, and the G7 to take active measures both in dismantling the Iranian regime’s terrorist capabilities and in protecting lives in the region and global stability. We will continue to coordinate with our partners. Glory to Ukraine!



English
alexusadays retweetledi
alexusadays retweetledi

The real story is the $25 million per mile price tag they’re betting on.
Nashville’s own 2018 light rail plan priced at $200 million per mile. New York’s East Side Access cost $3.5 billion per mile. The LA Metro expansion is running $1 billion per mile. The Boring Company says it can build 13 miles of twin tunnels through Nashville for $240-300 million total.
That’s a 95% cost reduction from the industry average. If the number holds, it rewrites the economics of every transit project in America. If it doesn’t, a few hundred million in private capital evaporates and taxpayers lose nothing. That risk asymmetry explains why Tennessee said yes when LA, Chicago, Baltimore, and DC all said no.
The engineering gamble is wild. 12-foot diameter tunnels instead of 28-foot. Fully electric Prufrock machines that mine continuously instead of stopping every 5 feet to install lining segments. Zero people in the tunnel during operations. A machine that “porpoises” into the ground from a truck instead of requiring million-dollar launch pits and cranes.
Every one of those innovations has worked in Las Vegas sand. None have been tested in karst limestone, the geology that creates sinkholes, caves, and underground streams. Their own CEO said at the unveiling that Nashville would not be their choice if they were optimizing for easiest places to tunnel.
This tells you everything about what The Boring Company is actually trying to prove. Nashville is where the thesis meets the hardest possible geology. 50 inches of annual rainfall versus Vegas’s 4. Rock that creates underground caves and streams. They just signed a construction contract in Dubai too, meaning they need Nashville to work before the next project launches.
The internal memo from the governor’s office estimates 1 mile per month. The Boring Company’s website claims 1 mile per week. That 4x gap between political planning and corporate marketing will determine whether this finishes in 2027 or 2030.
Week 7, when Prufrock-MB2 arrives, is when this gets real. Two machines boring simultaneously through Tennessee limestone will answer the question the entire tunneling industry has been debating for a decade: whether a startup can actually outrun the physics that made infrastructure the slowest-moving sector in construction.
The Boring Company@boringcompany
Tunneling has begun in Nashville - we are 2.5 feet in! Looking ahead: - Weeks 1-3: Prufrock-MB1 launches and undergoes a series of tests and calibrations (low production) - Weeks 4-6: scale to high production - Week 7: Prufrock-MB2 arrives
English
alexusadays retweetledi

🚨 META’s head of AI safety and alignment gets her emails nuked by OpenClaw
>be director of AI Safety and Alignment at Meta
>install OpenClaw
>give it unrestricted access to personal emails
>it starts nuking emails
>“Do not do that”
>*keeps going*
>“Stop don’t do anything”
>*gets all remaining old stuff and nukes it aswell*
>“STOP OPENCLAW”
>“I asked you to not do that”
>“do you remember that?”
>“Yes I remember. And I violated it.”
>“You’re right to be upset”
LMAOOOOOOOO



English
alexusadays retweetledi

Britain took on fifty nations. African kings. Arab sultans.
Bombarded ports. Deposed rulers who refused. Lost 1,600 men. Spent 40% of the entire Treasury.
A debt so big it wasn't paid off until 2015. You or your parents were still paying for it.
Not for land. Not for gold.
To end the slave trade.
They captured 1,600 ships. Freed 150,000 people. Patrolled 3,000 miles of coastline for sixty years.
And none of it was the government's idea. It was 400,000 ordinary people who signed petitions and 300,000 families who refused to buy sugar.
They forced Parliament's hand. Your ancestors changed the world and nobody told you.
If you think this should be taught in schools, help us reach more people: proudofus.co.uk/support
Be part of us
Be Proud Of Us 🇬🇧
English
alexusadays retweetledi

From Day One, @xAI has been upgrading grid infrastructure and ensuring ratepayers don’t pick up our tab.
Today, we are continuing our commitment by installing our own power lines to power our MACROHARD facilities.

English
alexusadays retweetledi
alexusadays retweetledi

21 people are controlling computers with their thoughts. Not in a research demo. In their daily lives.
Neuralink's Telepathy trial has nearly doubled since September. The implant reads your motor cortex and translates neural signals into digital commands. No voice. No eye tracking. No hands. One participant hit able-bodied cursor speeds in his first week. Others are typing at 40 words per minute using imagined finger movements.
A med student uses it 17 hours a day to study. A woman who hadn't controlled a computer in 20 years is making art. A father with ALS rigged a 360 camera to his wheelchair and controls it telepathically so he can watch his kids.
Next up: 3x the electrodes, a new trial targeting real-time speech at 140 WPM, and zero serious device-related adverse events across every participant so far.
Brain-computer interfaces are no longer theoretical.
English
alexusadays retweetledi

Making the dry electrode process work at scale, which is a major breakthrough in lithium battery production technology, was incredibly difficult.
Congratulations to the @Tesla engineering, production and supply chain teams and our strategic partner suppliers for this excellent achievement!
English








