
Skyfall AI
253 posts

Skyfall AI
@skyfallai
Building enterprise super intelligence
San Francisco Katılım Kasım 2024
107 Takip Edilen354 Takipçiler
Sabitlenmiş Tweet

📝 (1/n) Are current benchmarks where frontier models score 90%+ a true test for AGI?
ARC-AGI is a key benchmark for fluid intelligence in AI and frontier models are acing it. But, fluid intelligence in a sandboxed environment isn’t the same as running a business.
Human-like business intelligence requires:
▪️Spatial reasoning across complex 2D and 3D spaces
▪️Safety understanding: knowing which mistakes are irreversible
▪️Proactivity: anticipating problems, not just reacting to them
So we tested whether frontier models like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 which succeeded in ARC-AGI, could manage a real simulated business: Roller Coaster Tycoon. 🎢
Here’s what we found: 🧵
English

Impressive ambition but not surprised by the challenges here. 👀
At @skyfallai , we've been stress-testing agent decision-making and planning in simulated environments for a while now.
Through our MAPs simulator game we pitted AI agents against humans and they didn't do great. Same with our recent Roller Coaster Tycoon experiment, where we dropped agents into a Roller Coaster Tycoon-style sim and the results showed how much LLMs lack the business intelligence to run even a virtual operation profitably. 🤖
If AI struggles in a controlled sim, handing it a real retail lease is a much steeper climb. Time and time again, our research has shown the gap between an AI taking action and making good decisions is still very wide.
English

🚀 We're hiring our first Chief of Staff at @skyfallai. You will work directly with our CEO @spisallyouneed.
This is a rare opportunity to help shape the trajectory of the company at one of the most exciting moments in AI.
You'll sit at the heart of everything, from driving execution, cutting through complexity to making sure we move fast on what matters most. If you're ambitious, sharp, and ready to operate at the highest level, this role was made for you.
📌 must be based in SF or willing to relocate.
🔗link to apply in comments
#hiring #chiefofstaff

English

@spisallyouneed @Google @Meta @university28037 @UniofOxford @karpathy - I am curious what are your thoughts about this? Can AutoResearch replace a Phd from @Utoronto?
English

Big learning lesson for me from 50+ Research Scientist interviews we have done in this interview cycle - If you are a PhD/Post doc researcher from big tech like @Google or @Meta or a top university like @university28037 UToronto or @UniofOxford , it doesn't mean much since so many of them fail our 'live coding' interview to execute research ideas effectively. The standards have clearly fallen from when I used to interview back during Maluuba days 10 years ago. Credentials don't mean much these days and institution reputation is declining by the day.
English

🚨#HiringAlert We're hiring a Frontier Model Research Manager at @skyfallai to lead a team of world-class researchers at the frontier of Enterprise World Models from our Toronto office.
High ownership, zero bureaucracy, and work that actually matters.
You are going to be part of a team which is building the foundations of what is going to be the next big thing in AI - World Models. LLMs are hitting a wall in terms of productivity gains and will never get us to Enterprise Super Intelligence.
If you're reconsidering what's next, here's a thought.
Our heart goes out to everyone affected by the Oracle layoffs recently. These moments are tough, and they're becoming more frequent across big tech.
Traditional career paths in big tech are no longer as safe as they once seemed. Startups have never been a more compelling place to build a career: faster learning, real ownership, and a front-row seat to what's coming in AI.
If the timing feels right, apply below! 🙌

English

🧵 Why did we build an agent interface for Roller Coaster Tycoon?
Because RCT is a complex environment that requires high-level thinking and skills to succeed, similar to a real enterprise. There is this idea that if frontier LLMs are succeeding in a hard benchmark like ARC-AGI, then they are very close to achieving artificial business intelligence. But the results of our work shows otherwise.
GPT-5.4 and Gemini 3.1 fail to even build functional amusement parks due to poor spatial understanding.
Rides were built and there were no paths for the guests to access or exit them.
Claude Opus 4.6 demolished a path full of guests permanently trapping them in the park; ironically it did this to try and reduce the number of lost guests by simplifying the park layout. As a result, park value fell drastically.
The conclusion is simple - if AI can’t run RCT then it cannot run a business.
English
Skyfall AI retweetledi

We elaborate more in our blog, check it out here: skyfall.ai/blog/claude-gp…
English

🤖There are still massive gaps in what LLM agents can do. Earlier this week, we released our findings on what happens when frontier LLMs manage an amusement park.
The findings were shocking:
- Guests were left stranded after the paths they were on were demolished
- Park ratings fell because guests were unhappy
- Rides were inaccessible to guests due to incorrect path layouts and many more
Unlike in RCT 2, these mistakes cannot be reset in an enterprise setting.
This is more than a failure of spatial understanding, this is a lack of high level decision making and operational safeguards.

English

One of the most striking findings from our recent research was that LLMs like GPT-5.4 completely fail at a game like Roller Coaster Tycoon.
Of all the rides and shops it built, only 1 out of 9 were actually accessible to guests. In a real business setting, a mistake like that would be catastrophic. 🎢
Check our blog and tell us if you agree or not: skyfall.ai/blog/claude-gp…
English

Curious to dig deeper? We wrote up the full findings here 👉 skyfall.ai/blog/claude-gp…
And if you want to explore the SOTA business environment yourself or build on top of it, the code is open-sourced on our GitHub 👉 github.com/Skyfall-Resear…
English

@SethCronin @Kasparov63 @GaryMarcus Not so sure about this, it certainly didn't succeed in our game: skyfall.ai/blog/claude-gp…
English

@Kasparov63 @GaryMarcus 2020 AGI is when a computer can talk like a human and write software programs autonomously that solve real business problems.
2026 AGI is when a computer can play pixelated video games
English


Curious to see how this unfolds. Timely too since we just open-sourced our state-of-the-art business environment for researchers working towards AGI. If ARC-AGI-3 tests fluid intelligence, we're testing something different: can these models actually run a business?
The idea is simple: if AI can succeed at Roller Coaster Tycoon, it can succeed in real-world scenarios. Check it out here: github.com/Skyfall-Resear…
We also wrote a blog: skyfall.ai/blog/claude-gp…
English

📝 (1/n) Are current benchmarks where frontier models score 90%+ a true test for AGI?
ARC-AGI is a key benchmark for fluid intelligence in AI and frontier models are acing it. But, fluid intelligence in a sandboxed environment isn’t the same as running a business.
Human-like business intelligence requires:
▪️Spatial reasoning across complex 2D and 3D spaces
▪️Safety understanding: knowing which mistakes are irreversible
▪️Proactivity: anticipating problems, not just reacting to them
So we tested whether frontier models like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 which succeeded in ARC-AGI, could manage a real simulated business: Roller Coaster Tycoon. 🎢
Here’s what we found: 🧵
English

Exactly this. ARC-AGI is a great starting point for measuring fluid intelligence, but it only scratches the surface of what real-world capability looks like. Our team tested frontier LLMs in a simulated business environment and the findings were far from AGI-worthy.
You might find this interesting: skyfall.ai/blog/claude-gp…
English

We could effectively sabotage AI progress by defining a benchmark, which is not representative of capabilities useful in solving real world problems. This would influence heavy investments and enormous consumption of resources into developing AI to achieve high scores on this benchmark, but actually deliver no real value.
English

This is true, but ARC-AGI-3 is also a test designed so that AI gets zero today, just as the earlier ARC-AGI tests were designed . Those tests were then mostly saturated with a year or two.
The thing to watch with ARC-AGI-3 is whether we see the same progress.
Garry Kasparov@Kasparov63
Novel environments, no precedents or plagiarism possible. Humans 100%, AI <1%.
English

Totally agree! We just published our findings on exactly this. After testing frontier LLMs in a real simulated business environment, it became clear that world models are the missing piece for true artificial business intelligence. You might find our blog interesting: skyfall.ai/blog
English

@jack @roelofbotha The future of work is world models. Didn't think I'd see it becoming reality so soon after writing this!
strangeloopcanon.com/p/the-future-o…
English

our lead independent director @roelofbotha and i wrote about the history of organizational structures, and our intent to rebuild block as a mini-AGI. x.com/jack/status/20…
jack@jack
English

Great question!
We used a REPL harness which is a strong baseline approach, demonstrated to be very successful in ARC-AGI. We've previously tried with ReAct on the much simpler MAPs environment which also did poorly skyfall.ai/blog/building-…
Fundamentally, humans don't require a harness to learn and excel; if the frontier model required such a harness then we're already admitting a practical limitation to the model.
The question then becomes how complex the harness has to be; if it's a simple harness then our REPL is a strong baseline; there are already voices saying that all we need is a lightweight harness with a python interpreter and a lookup (e.g., blog.alexisfox.dev/arcagi3), and our results show that this isn't sufficient for strong performance, let alone safe behaviour. If it's a complex harness then this is no longer a frontier model, it's a neurosymbolic system -- the weaknesses of the model must be compensated for by a symbolic system, be it search, manual procedural flows, knowledge bases, etc... Engineering this system becomes very difficult, and the works we've seen so far tend to be specialized to their domain.
English

@skyfallai Really interesting experiment, but it’s hard to draw conclusions… is the issue the model’s intelligence or the agent’s harness? Would be interesting to try different harnesses and compare.
English
