Skyfall AI

253 posts

Skyfall AI banner
Skyfall AI

Skyfall AI

@skyfallai

Building enterprise super intelligence

San Francisco Katılım Kasım 2024
107 Takip Edilen354 Takipçiler
Sabitlenmiş Tweet
Skyfall AI
Skyfall AI@skyfallai·
📝 (1/n) Are current benchmarks where frontier models score 90%+ a true test for AGI? ARC-AGI is a key benchmark for fluid intelligence in AI and frontier models are acing it. But, fluid intelligence in a sandboxed environment isn’t the same as running a business. Human-like business intelligence requires: ▪️Spatial reasoning across complex 2D and 3D spaces ▪️Safety understanding: knowing which mistakes are irreversible ▪️Proactivity: anticipating problems, not just reacting to them So we tested whether frontier models like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 which succeeded in ARC-AGI, could manage a real simulated business: Roller Coaster Tycoon. 🎢 Here’s what we found: 🧵
English
15
19
55
66.1K
Skyfall AI
Skyfall AI@skyfallai·
Impressive ambition but not surprised by the challenges here. 👀 At @skyfallai , we've been stress-testing agent decision-making and planning in simulated environments for a while now. Through our MAPs simulator game we pitted AI agents against humans and they didn't do great. Same with our recent Roller Coaster Tycoon experiment, where we dropped agents into a Roller Coaster Tycoon-style sim and the results showed how much LLMs lack the business intelligence to run even a virtual operation profitably. 🤖 If AI struggles in a controlled sim, handing it a real retail lease is a much steeper climb. Time and time again, our research has shown the gap between an AI taking action and making good decisions is still very wide.
English
0
0
0
82
Andon Labs
Andon Labs@andonlabs·
We gave an AI a 3-year retail lease in SF and asked it to make a profit. The AI interviewed and hired full-time employees, applied for credit, and stocked the store with the books Superintelligence and Making of the Atomic Bomb. Visit Andon Market at 2102 Union St now.
English
102
156
2.4K
1.9M
Skyfall AI
Skyfall AI@skyfallai·
🚀 We're hiring our first Chief of Staff at @skyfallai. You will work directly with our CEO @spisallyouneed. This is a rare opportunity to help shape the trajectory of the company at one of the most exciting moments in AI. You'll sit at the heart of everything, from driving execution, cutting through complexity to making sure we move fast on what matters most. If you're ambitious, sharp, and ready to operate at the highest level, this role was made for you. 📌 must be based in SF or willing to relocate. 🔗link to apply in comments #hiring #chiefofstaff
Skyfall AI tweet media
English
1
0
2
276
Sam Pasupalak
Sam Pasupalak@spisallyouneed·
Big learning lesson for me from 50+ Research Scientist interviews we have done in this interview cycle - If you are a PhD/Post doc researcher from big tech like @Google or @Meta or a top university like @university28037 UToronto or @UniofOxford , it doesn't mean much since so many of them fail our 'live coding' interview to execute research ideas effectively. The standards have clearly fallen from when I used to interview back during Maluuba days 10 years ago. Credentials don't mean much these days and institution reputation is declining by the day.
English
1
0
1
112
Skyfall AI
Skyfall AI@skyfallai·
🚨#HiringAlert We're hiring a Frontier Model Research Manager at @skyfallai to lead a team of world-class researchers at the frontier of Enterprise World Models from our Toronto office. High ownership, zero bureaucracy, and work that actually matters. You are going to be part of a team which is building the foundations of what is going to be the next big thing in AI - World Models. LLMs are hitting a wall in terms of productivity gains and will never get us to Enterprise Super Intelligence. If you're reconsidering what's next, here's a thought. Our heart goes out to everyone affected by the Oracle layoffs recently. These moments are tough, and they're becoming more frequent across big tech. Traditional career paths in big tech are no longer as safe as they once seemed. Startups have never been a more compelling place to build a career: faster learning, real ownership, and a front-row seat to what's coming in AI. If the timing feels right, apply below! 🙌
Skyfall AI tweet media
English
1
2
2
587
Skyfall AI
Skyfall AI@skyfallai·
🧵 Why did we build an agent interface for Roller Coaster Tycoon? Because RCT is a complex environment that requires high-level thinking and skills to succeed, similar to a real enterprise. There is this idea that if frontier LLMs are succeeding in a hard benchmark like ARC-AGI, then they are very close to achieving artificial business intelligence. But the results of our work shows otherwise. GPT-5.4 and Gemini 3.1 fail to even build functional amusement parks due to poor spatial understanding. Rides were built and there were no paths for the guests to access or exit them. Claude Opus 4.6 demolished a path full of guests permanently trapping them in the park; ironically it did this to try and reduce the number of lost guests by simplifying the park layout. As a result, park value fell drastically. The conclusion is simple - if AI can’t run RCT then it cannot run a business.
English
1
2
3
186
Skyfall AI retweetledi
Skyfall AI
Skyfall AI@skyfallai·
Here's how top LLMs like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 perform on the agent interface Roller Coaster Tycoon that we built. We address the key gaps we uncovered along the way in our blog. 📌 Link below.
English
1
1
2
218
Skyfall AI
Skyfall AI@skyfallai·
🤖There are still massive gaps in what LLM agents can do. Earlier this week, we released our findings on what happens when frontier LLMs manage an amusement park. The findings were shocking: - Guests were left stranded after the paths they were on were demolished - Park ratings fell because guests were unhappy - Rides were inaccessible to guests due to incorrect path layouts and many more Unlike in RCT 2, these mistakes cannot be reset in an enterprise setting. This is more than a failure of spatial understanding, this is a lack of high level decision making and operational safeguards.
Skyfall AI tweet media
English
1
0
2
71
Skyfall AI
Skyfall AI@skyfallai·
One of the most striking findings from our recent research was that LLMs like GPT-5.4 completely fail at a game like Roller Coaster Tycoon. Of all the rides and shops it built, only 1 out of 9 were actually accessible to guests. In a real business setting, a mistake like that would be catastrophic. 🎢 Check our blog and tell us if you agree or not: skyfall.ai/blog/claude-gp…
English
0
0
0
58
Seth Cronin
Seth Cronin@SethCronin·
@Kasparov63 @GaryMarcus 2020 AGI is when a computer can talk like a human and write software programs autonomously that solve real business problems. 2026 AGI is when a computer can play pixelated video games
English
4
1
10
1.8K
Skyfall AI
Skyfall AI@skyfallai·
Curious to see how this unfolds. Timely too since we just open-sourced our state-of-the-art business environment for researchers working towards AGI. If ARC-AGI-3 tests fluid intelligence, we're testing something different: can these models actually run a business? The idea is simple: if AI can succeed at Roller Coaster Tycoon, it can succeed in real-world scenarios. Check it out here: github.com/Skyfall-Resear… We also wrote a blog: skyfall.ai/blog/claude-gp…
English
0
0
1
255
Skyfall AI
Skyfall AI@skyfallai·
📝 (1/n) Are current benchmarks where frontier models score 90%+ a true test for AGI? ARC-AGI is a key benchmark for fluid intelligence in AI and frontier models are acing it. But, fluid intelligence in a sandboxed environment isn’t the same as running a business. Human-like business intelligence requires: ▪️Spatial reasoning across complex 2D and 3D spaces ▪️Safety understanding: knowing which mistakes are irreversible ▪️Proactivity: anticipating problems, not just reacting to them So we tested whether frontier models like GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6 which succeeded in ARC-AGI, could manage a real simulated business: Roller Coaster Tycoon. 🎢 Here’s what we found: 🧵
English
15
19
55
66.1K
Skyfall AI
Skyfall AI@skyfallai·
Exactly this. ARC-AGI is a great starting point for measuring fluid intelligence, but it only scratches the surface of what real-world capability looks like. Our team tested frontier LLMs in a simulated business environment and the findings were far from AGI-worthy. You might find this interesting: skyfall.ai/blog/claude-gp…
English
0
0
1
17
Ben Eng
Ben Eng@jetpen·
We could effectively sabotage AI progress by defining a benchmark, which is not representative of capabilities useful in solving real world problems. This would influence heavy investments and enormous consumption of resources into developing AI to achieve high scores on this benchmark, but actually deliver no real value.
English
1
0
1
82
Skyfall AI
Skyfall AI@skyfallai·
Totally agree! We just published our findings on exactly this. After testing frontier LLMs in a real simulated business environment, it became clear that world models are the missing piece for true artificial business intelligence. You might find our blog interesting: skyfall.ai/blog
English
0
0
0
43
Skyfall AI
Skyfall AI@skyfallai·
Great question! We used a REPL harness which is a strong baseline approach, demonstrated to be very successful in ARC-AGI. We've previously tried with ReAct on the much simpler MAPs environment which also did poorly skyfall.ai/blog/building-… Fundamentally, humans don't require a harness to learn and excel; if the frontier model required such a harness then we're already admitting a practical limitation to the model. The question then becomes how complex the harness has to be; if it's a simple harness then our REPL is a strong baseline; there are already voices saying that all we need is a lightweight harness with a python interpreter and a lookup (e.g., blog.alexisfox.dev/arcagi3), and our results show that this isn't sufficient for strong performance, let alone safe behaviour. If it's a complex harness then this is no longer a frontier model, it's a neurosymbolic system -- the weaknesses of the model must be compensated for by a symbolic system, be it search, manual procedural flows, knowledge bases, etc... Engineering this system becomes very difficult, and the works we've seen so far tend to be specialized to their domain.
English
0
0
0
42
Juan Echeverria
Juan Echeverria@JuanEcheverrria·
@skyfallai Really interesting experiment, but it’s hard to draw conclusions… is the issue the model’s intelligence or the agent’s harness? Would be interesting to try different harnesses and compare.
English
1
0
2
179