Skyfall AI

253 posts

Skyfall AI

@skyfallai

Building enterprise super intelligence

San Francisco Katılım Kasım 2024

107 Takip Edilen354 Takipçiler

Sabitlenmiş Tweet

Skyfall AI@skyfallai·30 Mar

📝 (1/n) Are current benchmarks where frontier models score 90%+ a true test for AGI? ARC-AGI is a key benchmark for fluid intelligence in AI and frontier models are acing it. But, fluid intelligence in a sandboxed environment isn’t the same as running a business. Human-like business intelligence requires: ▪️Spatial reasoning across complex 2D and 3D spaces ▪️Safety understanding: knowing which mistakes are irreversible ▪️Proactivity: anticipating problems, not just reacting to them So we tested whether frontier models like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 which succeeded in ARC-AGI, could manage a real simulated business: Roller Coaster Tycoon. 🎢 Here’s what we found: 🧵

English

66.1K

Skyfall AI@skyfallai·16 Nis

Impressive ambition but not surprised by the challenges here. 👀 At @skyfallai , we've been stress-testing agent decision-making and planning in simulated environments for a while now. Through our MAPs simulator game we pitted AI agents against humans and they didn't do great. Same with our recent Roller Coaster Tycoon experiment, where we dropped agents into a Roller Coaster Tycoon-style sim and the results showed how much LLMs lack the business intelligence to run even a virtual operation profitably. 🤖 If AI struggles in a controlled sim, handing it a real retail lease is a much steeper climb. Time and time again, our research has shown the gap between an AI taking action and making good decisions is still very wide.

English

Andon Labs@andonlabs·11 Nis

We gave an AI a 3-year retail lease in SF and asked it to make a profit. The AI interviewed and hired full-time employees, applied for credit, and stocked the store with the books Superintelligence and Making of the Atomic Bomb. Visit Andon Market at 2102 Union St now.

English

102

156

2.4K

1.9M

Skyfall AI@skyfallai·15 Nis

@spisallyouneed ✨Apply here: zrec.ca/S7FJ6?source=C…

English

Skyfall AI@skyfallai·15 Nis

🚀 We're hiring our first Chief of Staff at @skyfallai. You will work directly with our CEO @spisallyouneed. This is a rare opportunity to help shape the trajectory of the company at one of the most exciting moments in AI. You'll sit at the heart of everything, from driving execution, cutting through complexity to making sure we move fast on what matters most. If you're ambitious, sharp, and ready to operate at the highest level, this role was made for you. 📌 must be based in SF or willing to relocate. 🔗link to apply in comments #hiring #chiefofstaff

English

276

Skyfall AI@skyfallai·8 Nis

@spisallyouneed @Google @Meta @university28037 @UniofOxford @karpathy - I am curious what are your thoughts about this? Can AutoResearch replace a Phd from @Utoronto?

English

Sam Pasupalak@spisallyouneed·8 Nis

Big learning lesson for me from 50+ Research Scientist interviews we have done in this interview cycle - If you are a PhD/Post doc researcher from big tech like @Google or @Meta or a top university like @university28037 UToronto or @UniofOxford , it doesn't mean much since so many of them fail our 'live coding' interview to execute research ideas effectively. The standards have clearly fallen from when I used to interview back during Maluuba days 10 years ago. Credentials don't mean much these days and institution reputation is declining by the day.

English

112

Skyfall AI@skyfallai·7 Nis

📌Apply here: zrec.ca/ERgaC?source=C…

English

Skyfall AI@skyfallai·7 Nis

🚨#HiringAlert We're hiring a Frontier Model Research Manager at @skyfallai to lead a team of world-class researchers at the frontier of Enterprise World Models from our Toronto office. High ownership, zero bureaucracy, and work that actually matters. You are going to be part of a team which is building the foundations of what is going to be the next big thing in AI - World Models. LLMs are hitting a wall in terms of productivity gains and will never get us to Enterprise Super Intelligence. If you're reconsidering what's next, here's a thought. Our heart goes out to everyone affected by the Oracle layoffs recently. These moments are tough, and they're becoming more frequent across big tech. Traditional career paths in big tech are no longer as safe as they once seemed. Startups have never been a more compelling place to build a career: faster learning, real ownership, and a front-row seat to what's coming in AI. If the timing feels right, apply below! 🙌

English

587

Skyfall AI@skyfallai·1 Nis

🧵 Why did we build an agent interface for Roller Coaster Tycoon? Because RCT is a complex environment that requires high-level thinking and skills to succeed, similar to a real enterprise. There is this idea that if frontier LLMs are succeeding in a hard benchmark like ARC-AGI, then they are very close to achieving artificial business intelligence. But the results of our work shows otherwise. GPT-5.4 and Gemini 3.1 fail to even build functional amusement parks due to poor spatial understanding. Rides were built and there were no paths for the guests to access or exit them. Claude Opus 4.6 demolished a path full of guests permanently trapping them in the park; ironically it did this to try and reduce the number of lost guests by simplifying the park layout. As a result, park value fell drastically. The conclusion is simple - if AI can’t run RCT then it cannot run a business.

English

186

Skyfall AI retweetledi

Skyfall AI@skyfallai·31 Mar

Here's how top LLMs like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 perform on the agent interface Roller Coaster Tycoon that we built. We address the key gaps we uncovered along the way in our blog. 📌 Link below.

English

218

Skyfall AI@skyfallai·2 Nis

We elaborate more in our blog, check it out here: skyfall.ai/blog/claude-gp…

English

Skyfall AI@skyfallai·2 Nis

🤖There are still massive gaps in what LLM agents can do. Earlier this week, we released our findings on what happens when frontier LLMs manage an amusement park. The findings were shocking: - Guests were left stranded after the paths they were on were demolished - Park ratings fell because guests were unhappy - Rides were inaccessible to guests due to incorrect path layouts and many more Unlike in RCT 2, these mistakes cannot be reset in an enterprise setting. This is more than a failure of spatial understanding, this is a lack of high level decision making and operational safeguards.

English

Skyfall AI@skyfallai·2 Nis

One of the most striking findings from our recent research was that LLMs like GPT-5.4 completely fail at a game like Roller Coaster Tycoon. Of all the rides and shops it built, only 1 out of 9 were actually accessible to guests. In a real business setting, a mistake like that would be catastrophic. 🎢 Check our blog and tell us if you agree or not: skyfall.ai/blog/claude-gp…

English

Skyfall AI@skyfallai·1 Nis

Curious to dig deeper? We wrote up the full findings here 👉 skyfall.ai/blog/claude-gp… And if you want to explore the SOTA business environment yourself or build on top of it, the code is open-sourced on our GitHub 👉 github.com/Skyfall-Resear…

English

Skyfall AI@skyfallai·31 Mar

@SethCronin @Kasparov63 @GaryMarcus Not so sure about this, it certainly didn't succeed in our game: skyfall.ai/blog/claude-gp…

English

Seth Cronin@SethCronin·30 Mar

@Kasparov63 @GaryMarcus 2020 AGI is when a computer can talk like a human and write software programs autonomously that solve real business problems. 2026 AGI is when a computer can play pixelated video games

English

1.8K

Garry Kasparov@Kasparov63·30 Mar

Novel environments, no precedents or plagiarism possible. Humans 100%, AI <1%.

Guri Singh@heygurisingh

Humans: 100% Gemini 3.1 Pro: 0.37% GPT 5.4: 0.26% Opus 4.6: 0.25% Grok-4.20: 0.00% François Chollet just released ARC-AGI-3 -- the hardest AI test ever created. 135 novel game environments. No instructions. No rules. No goals given. Figure it out or fail. Untrained humans solved every single one. Every frontier AI model scored below 1%. Each environment was handcrafted by game designers. The AI gets dropped in and has to explore, discover what winning looks like, and adapt in real time. The scoring punishes brute force. If a human needs 10 actions and the AI needs 100, the AI doesn't get 10%. It gets 1%. You can't throw more compute at this. For context: ARC-AGI-1 is basically solved. Gemini scores 98% on it. ARC-AGI-2 went from 3% to 77% in under a year. Labs spent millions training on earlier versions. ARC-AGI-3 resets the entire scoreboard to near zero. The benchmark launched live at Y Combinator with a fireside between Chollet and Sam Altman. $2M in prizes on Kaggle. All winning solutions must be open-sourced. Scaling alone will not close this gap. We are nowhere near AGI. (Link in the comments)

Català

137

302

2.4K

572.1K

Skyfall AI@skyfallai·31 Mar

Curious to see how this unfolds. Timely too since we just open-sourced our state-of-the-art business environment for researchers working towards AGI. If ARC-AGI-3 tests fluid intelligence, we're testing something different: can these models actually run a business? The idea is simple: if AI can succeed at Roller Coaster Tycoon, it can succeed in real-world scenarios. Check it out here: github.com/Skyfall-Resear… We also wrote a blog: skyfall.ai/blog/claude-gp…

English

255

Skyfall AI@skyfallai·30 Mar

English

66.1K

Skyfall AI@skyfallai·31 Mar

Exactly this. ARC-AGI is a great starting point for measuring fluid intelligence, but it only scratches the surface of what real-world capability looks like. Our team tested frontier LLMs in a simulated business environment and the findings were far from AGI-worthy. You might find this interesting: skyfall.ai/blog/claude-gp…

English

Ben Eng@jetpen·31 Mar

We could effectively sabotage AI progress by defining a benchmark, which is not representative of capabilities useful in solving real world problems. This would influence heavy investments and enormous consumption of resources into developing AI to achieve high scores on this benchmark, but actually deliver no real value.

English

Ethan Mollick@emollick·30 Mar

This is true, but ARC-AGI-3 is also a test designed so that AI gets zero today, just as the earlier ARC-AGI tests were designed . Those tests were then mostly saturated with a year or two. The thing to watch with ARC-AGI-3 is whether we see the same progress.

Garry Kasparov@Kasparov63

Novel environments, no precedents or plagiarism possible. Humans 100%, AI <1%.

English

340

41.5K

Skyfall AI@skyfallai·31 Mar

Totally agree! We just published our findings on exactly this. After testing frontier LLMs in a real simulated business environment, it became clear that world models are the missing piece for true artificial business intelligence. You might find our blog interesting: skyfall.ai/blog

English

rohit@krishnanrohit·31 Mar

@jack @roelofbotha The future of work is world models. Didn't think I'd see it becoming reality so soon after writing this! strangeloopcanon.com/p/the-future-o…

English

498

jack@jack·31 Mar

our lead independent director @roelofbotha and i wrote about the history of organizational structures, and our intent to rebuild block as a mini-AGI. x.com/jack/status/20…

jack@jack

x.com/i/article/2038…

English

122

191

1.8K

354.1K

Skyfall AI@skyfallai·31 Mar

Great question! We used a REPL harness which is a strong baseline approach, demonstrated to be very successful in ARC-AGI. We've previously tried with ReAct on the much simpler MAPs environment which also did poorly skyfall.ai/blog/building-… Fundamentally, humans don't require a harness to learn and excel; if the frontier model required such a harness then we're already admitting a practical limitation to the model. The question then becomes how complex the harness has to be; if it's a simple harness then our REPL is a strong baseline; there are already voices saying that all we need is a lightweight harness with a python interpreter and a lookup (e.g., blog.alexisfox.dev/arcagi3), and our results show that this isn't sufficient for strong performance, let alone safe behaviour. If it's a complex harness then this is no longer a frontier model, it's a neurosymbolic system -- the weaknesses of the model must be compensated for by a symbolic system, be it search, manual procedural flows, knowledge bases, etc... Engineering this system becomes very difficult, and the works we've seen so far tend to be specialized to their domain.

English

Juan Echeverria@JuanEcheverrria·30 Mar

@skyfallai Really interesting experiment, but it’s hard to draw conclusions… is the issue the model’s intelligence or the agent’s harness? Would be interesting to try different harnesses and compare.

English

179

Keşfet

@spisallyouneed @Google @Meta @university28037 @UniofOxford @karpathy @utoronto @SethCronin