Arnoldas

777 posts

Arnoldas banner
Arnoldas

Arnoldas

@3DArnold

Arnoldas Kemeklis Product Manager LLM hacker @ Protolabs (ex. Hubs) @ YT: Arnoldas Kemeklis

Chicago Katılım Kasım 2014
449 Takip Edilen231 Takipçiler
Arnoldas
Arnoldas@3DArnold·
This is sick! A real benchmark informed by real business chaos! We've got quite some work to do. A great benchmark often means the labs will now be able to overfit their capabilities on top of it, but it also means - lots more automation will be possible in the real (digital) world. Of course, if you're starting a new business. Just don't overcomplicate things and use folders and text files and you'll be fine.
Wade Foster@wadefoster

We built an AI benchmark that measures real work. Today we're releasing it to everyone. AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way? We went looking for a benchmark that tested that. Nobody had built one, so we did. @Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done. The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades. Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't. It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this. No model has cracked 10%. Yet. Try it here: zapier.com/benchmarks

English
0
0
0
24
Arnoldas
Arnoldas@3DArnold·
I believe that business context will be slowly brought back to folders & text files formatted in the most efficient way for LLM's to navigate through it. Below is my first humble attempt to test out different business context compositions for performance & cost. You can see clearly that all context in one file is both expensive and not that good, but even worse is just a messy random spread of files and folders. Combining some of my internal tool ideas with latest from @karpathy & @VanCliefMedia on this topic, I've got some promising folder structures that warrant more testing!
Arnoldas tweet mediaArnoldas tweet media
English
0
0
0
14
Arnoldas
Arnoldas@3DArnold·
Here's a way to prepare for AGI: Every day, just make at least one small decision as if the AGI is here already. Assume there is superintelligence available to you in any of the chatbot you use. 1. It will align you on the current capabilities. 2. it will make you think about your specific life situations and how technology can affect them. 3. You'll get to practice the life of the future. All the lab leaders say to "get immersed in these tools", but I think equally as important is to spend time thinking about how it connects to your life.
Arnoldas tweet media
English
0
0
0
10
Arnoldas
Arnoldas@3DArnold·
Happy to report that I've finally managed to burn through my weekly rate caps on all 4 coding services at the same time - Claude - Codex - Cursor - (even) Google Antigravity I am really "token poor" right now. What a day
Arnoldas tweet media
English
0
0
0
55
Arnoldas
Arnoldas@3DArnold·
My skills are becoming irrelevant and that's okay! the benefit of not being that great at technical stuff was to experience the slow erosion of some of my skills into irrelevance. By 2023 - chatgpt writes python better than me By 2024 - Any css/HTML that I learned is irrelevant By 2025 - It writes all my SQL (the only thing I had most experience with) By 2026 - Working with LLM's themselves is already just one prompt away too. My hope is that the labs solve hallucinations asap so that I have some advantage over LLM's (I am very good at hallucinations).
English
0
0
0
22
Arnoldas
Arnoldas@3DArnold·
The future of internal tooling - folders
Arnoldas tweet media
English
0
0
0
7
Arnoldas
Arnoldas@3DArnold·
just discovered that @OpenAI Atlas had 118 GB of debug.log on my device.
Arnoldas tweet media
English
0
0
0
10
Arnoldas
Arnoldas@3DArnold·
@ericzakariasson @cursor_ai Really weird bug this morning. using the browser forces models to immediately summarize their content and starts spitting out nonsense. Specifically writing python code about coding languages
Arnoldas tweet media
English
0
0
0
39
eric zakariasson
eric zakariasson@ericzakariasson·
with composer-1, working with coding agents feels a lot more like autocomplete with Tab, keeping you in flow and giving you more control
English
17
3
150
9.5K
Arnoldas
Arnoldas@3DArnold·
@kimmonismus rarely but this was useful: x.com/3DArnold/statu… also had it file customs forms based on prev conversations and get shopping quotes. As long as it knows what to do before starting the task its fine. Just dont use the PRO model with it
Arnoldas@3DArnold

Just had @OpenAI Atlas browser fill my roadtrip list on my google maps! Epic - its useful immediately where in the tools I already use. And the tips are personal based on an extensive conversation!

English
0
0
3
1.2K
Arnoldas
Arnoldas@3DArnold·
Just had @OpenAI Atlas browser fill my roadtrip list on my google maps! Epic - its useful immediately where in the tools I already use. And the tips are personal based on an extensive conversation!
English
0
0
4
1.5K
Arnoldas
Arnoldas@3DArnold·
Here's my videogen "benchmark". I'll call it a "3D Printing test". Check out different models perform 1 shot with a simple prompt.
English
0
0
0
24
Arnoldas
Arnoldas@3DArnold·
@_samirism oh no - i just had to remove all the memory 🤪
English
0
0
0
36
Arnoldas
Arnoldas@3DArnold·
Your app could become available to over 800M users (and growing) this year! A new, once in a generation distribution opportunity is opening up with the ChatGPT apps. - get discovered in their own App store - app triggered automatically given the conversation context - Payments available without leaving chatGPT - Opportunity to build new "ChatGPT" native services. Strategize accordingly.
Arnoldas tweet media
English
2
0
0
48
Arnoldas retweetledi
Guillermo Rauch
Guillermo Rauch@rauchg·
Can ChatGPT run Doom? Yes. ChatGPT Apps are very powerful. I cloned our Next.js ChatGPT template, registered a 𝚙𝚕𝚊𝚢_𝚍𝚘𝚘𝚖 MCP tool and deployed to @vercel. Once the tool is called, ChatGPT embeds the full @nextjs application. Server and client rendering just works, and it's 100% interactive. h/t @andrewqu @allenzhou101 for the starter kit: vercel.com/templates/ai/c…
English
108
156
1.8K
392.7K
Arnoldas
Arnoldas@3DArnold·
@JozefARK @OpenAI @JozefARK How do you distinguish this from the "Plugins" and "CustomGPT's" from ChatGPT apps. Its the 3rd iteration of the same thing that didn't really work past demo the first few times.
English
1
0
0
27
Jozef Soja
Jozef Soja@JozefARK·
We're seeing an "app store moment" for ChatGPT with @OpenAI's Apps SDK. Not only is this an opportunity for software companies to distribute to ChatGPT's 800 million weekly active users, but it's also a key step towards making ChatGPT a superapp for consumers and enterprises.
ARK Invest@ARKInvest

x.com/i/article/1977…

English
4
3
16
27.3K
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
You can now Vibe Code voice AI agents and experiences in @GoogleAIStudio for free, with just a prompt 🔊 Earlier this week we announced our updated Gemini Live model, which excels at natural conversations. So easy to get started building! ai dot studio / build
Logan Kilpatrick tweet media
English
81
154
1.5K
159K
Arnoldas
Arnoldas@3DArnold·
@yongyuanxi Useful. Your site happens to be down tho 😅
English
1
0
1
184
Towaki Takikawa / 瀧川永遠希
Yeah document parsing is cool, but what about CAD drawing parsing... (but with documents too 🥺)
Towaki Takikawa / 瀧川永遠希 tweet media
English
34
45
608
94.1K
Arnoldas retweetledi
Pessimists Archive
Pessimists Archive@PessimistsArc·
Parents yearn for the good old days of childhood for their kids, not realizing it was a brave new world for their parents.
Pessimists Archive tweet mediaPessimists Archive tweet mediaPessimists Archive tweet mediaPessimists Archive tweet media
English
3
18
76
16.2K
@levelsio
@levelsio@levelsio·
Is there any Whisper live I can run on for ex @FAL @burkaygur Then live feed that into translation? I wanna make a mini web app for myself to live hear the Portuguese construction workers, mechanics, electricians None of them speak English and I can speak Portuguese basic now but it's hard to understand them because 1) they get quite complex with topics, 2) the Portugal Portuguese is spoken in-mouth with not much articulation so it's harder to understand Whisper flawlessly grasps it though Google Translate doesn't of coursw So I thought if I could live transcribe and translate it I could solve my own problem
English
77
0
115
44.2K