Arnoldas

777 posts

Arnoldas

@3DArnold

Arnoldas Kemeklis Product Manager LLM hacker @ Protolabs (ex. Hubs) @ YT: Arnoldas Kemeklis

Chicago Katılım Kasım 2014

449 Takip Edilen231 Takipçiler

Arnoldas@3DArnold·1d

This is sick! A real benchmark informed by real business chaos! We've got quite some work to do. A great benchmark often means the labs will now be able to overfit their capabilities on top of it, but it also means - lots more automation will be possible in the real (digital) world. Of course, if you're starting a new business. Just don't overcomplicate things and use folders and text files and you'll be fine.

Wade Foster@wadefoster

We built an AI benchmark that measures real work. Today we're releasing it to everyone. AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way? We went looking for a benchmark that tested that. Nobody had built one, so we did. @Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done. The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades. Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't. It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this. No model has cracked 10%. Yet. Try it here: zapier.com/benchmarks

English

Arnoldas@3DArnold·10 Nis

I believe that business context will be slowly brought back to folders & text files formatted in the most efficient way for LLM's to navigate through it. Below is my first humble attempt to test out different business context compositions for performance & cost. You can see clearly that all context in one file is both expensive and not that good, but even worse is just a messy random spread of files and folders. Combining some of my internal tool ideas with latest from @karpathy & @VanCliefMedia on this topic, I've got some promising folder structures that warrant more testing!

English

Arnoldas@3DArnold·8 Nis

Here's a way to prepare for AGI: Every day, just make at least one small decision as if the AGI is here already. Assume there is superintelligence available to you in any of the chatbot you use. 1. It will align you on the current capabilities. 2. it will make you think about your specific life situations and how technology can affect them. 3. You'll get to practice the life of the future. All the lab leaders say to "get immersed in these tools", but I think equally as important is to spend time thinking about how it connects to your life.

English

Arnoldas@3DArnold·7 Nis

Happy to report that I've finally managed to burn through my weekly rate caps on all 4 coding services at the same time - Claude - Codex - Cursor - (even) Google Antigravity I am really "token poor" right now. What a day

English

Arnoldas@3DArnold·1 Nis

My skills are becoming irrelevant and that's okay! the benefit of not being that great at technical stuff was to experience the slow erosion of some of my skills into irrelevance. By 2023 - chatgpt writes python better than me By 2024 - Any css/HTML that I learned is irrelevant By 2025 - It writes all my SQL (the only thing I had most experience with) By 2026 - Working with LLM's themselves is already just one prompt away too. My hope is that the labs solve hallucinations asap so that I have some advantage over LLM's (I am very good at hallucinations).

English

Arnoldas@3DArnold·30 Mar

The future of internal tooling - folders

English

Arnoldas@3DArnold·27 Mar

just discovered that @OpenAI Atlas had 118 GB of debug.log on my device.

English

Arnoldas@3DArnold·7 Kas

@ericzakariasson @cursor_ai Really weird bug this morning. using the browser forces models to immediately summarize their content and starts spitting out nonsense. Specifically writing python code about coding languages

English

eric zakariasson@ericzakariasson·7 Kas

with composer-1, working with coding agents feels a lot more like autocomplete with Tab, keeping you in flow and giving you more control

English

150

9.5K

Arnoldas@3DArnold·1 Kas

@kimmonismus rarely but this was useful: x.com/3DArnold/statu… also had it file customs forms based on prev conversations and get shopping quotes. As long as it knows what to do before starting the task its fine. Just dont use the PRO model with it

Arnoldas@3DArnold

Just had @OpenAI Atlas browser fill my roadtrip list on my google maps! Epic - its useful immediately where in the tools I already use. And the tips are personal based on an extensive conversation!

English

1.2K

Chubby♨️@kimmonismus·31 Eki

Serious question: does anyone use ChatGPT Agent regularly? Does anyone have a use case where it is really useful? I certainly don't.

OpenAI@OpenAI

Turn on agent mode and ChatGPT can take action for you—research, plan, and get things done while you browse. Now in preview for Plus, Pro, and Business users.

English

433

1.1K

314.6K

Arnoldas@3DArnold·29 Eki

Just had @OpenAI Atlas browser fill my roadtrip list on my google maps! Epic - its useful immediately where in the tools I already use. And the tips are personal based on an extensive conversation!

English

1.5K

Arnoldas@3DArnold·19 Eki

Here's my videogen "benchmark". I'll call it a "3D Printing test". Check out different models perform 1 shot with a simple prompt.

English

Arnoldas@3DArnold·16 Eki

@_samirism oh no - i just had to remove all the memory 🤪

English

samir@_samirism·16 Eki

now that memory will never be full. how else can we make memory better ?

OpenAI@OpenAI

ChatGPT can now automatically manage your saved memories—no more “memory full.” You can also search and sort memories by recency, and choose which to re-prioritize in settings. Rolling out to Plus and Pro users on the web globally starting today. x.com/ngmarley/statu…

English

353

55.6K

Arnoldas@3DArnold·15 Eki

Your app could become available to over 800M users (and growing) this year! A new, once in a generation distribution opportunity is opening up with the ChatGPT apps. - get discovered in their own App store - app triggered automatically given the conversation context - Payments available without leaving chatGPT - Opportunity to build new "ChatGPT" native services. Strategize accordingly.

English

Arnoldas retweetledi

Guillermo Rauch@rauchg·15 Eki

Can ChatGPT run Doom? Yes. ChatGPT Apps are very powerful. I cloned our Next.js ChatGPT template, registered a 𝚙𝚕𝚊𝚢_𝚍𝚘𝚘𝚖 MCP tool and deployed to @vercel. Once the tool is called, ChatGPT embeds the full @nextjs application. Server and client rendering just works, and it's 100% interactive. h/t @andrewqu @allenzhou101 for the starter kit: vercel.com/templates/ai/c…

English

108

156

1.8K

392.7K

Arnoldas@3DArnold·13 Eki

@JozefARK @OpenAI @JozefARK How do you distinguish this from the "Plugins" and "CustomGPT's" from ChatGPT apps. Its the 3rd iteration of the same thing that didn't really work past demo the first few times.

English

Jozef Soja@JozefARK·13 Eki

We're seeing an "app store moment" for ChatGPT with @OpenAI's Apps SDK. Not only is this an opportunity for software companies to distribute to ChatGPT's 800 million weekly active users, but it's also a key step towards making ChatGPT a superapp for consumers and enterprises.

ARK Invest@ARKInvest

x.com/i/article/1977…

English

27.3K

Arnoldas@3DArnold·27 Eyl

@OfficialLoganK @GoogleAIStudio Will give it a try. I have one simple idea that needs to be tested out. This seems like a quick way to do it.

English

2.3K

Logan Kilpatrick@OfficialLoganK·27 Eyl

You can now Vibe Code voice AI agents and experiences in @GoogleAIStudio for free, with just a prompt 🔊 Earlier this week we announced our updated Gemini Live model, which excels at natural conversations. So easy to get started building! ai dot studio / build

English

154

1.5K

159K

Arnoldas@3DArnold·2 Eyl

@yongyuanxi Useful. Your site happens to be down tho 😅

English

184

Towaki Takikawa / 瀧川永遠希@yongyuanxi·20 Ağu

Yeah document parsing is cool, but what about CAD drawing parsing... (but with documents too 🥺)

English

608

94.1K

Arnoldas retweetledi

Pessimists Archive@PessimistsArc·2 Ağu

Parents yearn for the good old days of childhood for their kids, not realizing it was a brave new world for their parents.

English

16.2K

Arnoldas@3DArnold·30 Tem

@levelsio @FAL @burkaygur need!

English

392

@levelsio@levelsio·30 Tem

Is there any Whisper live I can run on for ex @FAL @burkaygur Then live feed that into translation? I wanna make a mini web app for myself to live hear the Portuguese construction workers, mechanics, electricians None of them speak English and I can speak Portuguese basic now but it's hard to understand them because 1) they get quite complex with topics, 2) the Portugal Portuguese is spoken in-mouth with not much articulation so it's harder to understand Whisper flawlessly grasps it though Google Translate doesn't of coursw So I thought if I could live transcribe and translate it I could solve my own problem

English

115

44.2K

Arnoldas retweetledi

Pessimists Archive@PessimistsArc·28 Haz

Cognitive decline! (1953) pessimistsarchive.org/list/novels/cl…

English

112

35.2K

Keşfet

@karpathy @VanCliefMedia @OpenAI @ericzakariasson @cursor_ai @kimmonismus @_samirism @vercel