Alex

3.4K posts

Alex banner
Alex

Alex

@abc0008

Gold-hatted, high-bouncing.

Birmingham, AL Katılım Haziran 2009
3.2K Takip Edilen205 Takipçiler
Felix Rieseberg
Felix Rieseberg@felixrieseberg·
This is releasing to macOS today, Windows will follow in the next few weeks. The entire computer use field is early - Claude will move slowly and deliberately, much slower than a human does today. To try it out, download the app from claude.com/download
English
41
39
808
129K
Felix Rieseberg
Felix Rieseberg@felixrieseberg·
Today, we’re releasing a feature that allows Claude to control your computer: Mouse, keyboard, and screen, giving it the ability to use any app. I believe this is especially useful if used with Dispatch, which allows you to remotely control Claude on your computer while you’re away.
English
752
1.2K
16.1K
3.5M
Flowers ☾
Flowers ☾@flowersslop·
Gun to your head name an Anthropic employee
English
50
0
71
9.9K
Alex
Alex@abc0008·
@TheStalwart And just like that, Google retains the consumer crown
English
0
0
0
143
Rohan Paul
Rohan Paul@rohanpaul_ai·
Great use of Claude Code. Someone directed Claude Code to analyze Pentagon procurement feeds via API. It compared 1.2 mn awards against retail prices. Flagged 340 contracts with over 10x markups worth $4.2 bn in potential undercuts.
English
19
35
263
27.3K
Alex
Alex@abc0008·
@emollick Claude has skills and connectors/mcp. But my issue with Claude is it often models stuff in python and just displays the results in excel. We need a toggle to force entirely excel based modeling. @claudeai
English
0
0
0
58
Ethan Mollick
Ethan Mollick@emollick·
Hey Excel agents from Claude, OpenAI & MS Copilot: "make me a working strategy game in excel, it should have some form of graphics" Claude made a board and acted as game master, Copilot created a board but no game, ChatGPT built a working game with formulas with a "smart" enemy.
Ethan Mollick tweet mediaEthan Mollick tweet mediaEthan Mollick tweet media
English
37
33
372
130.1K
Alex
Alex@abc0008·
@trq212 “GenEng “
Indonesia
0
0
0
17
Thariq
Thariq@trq212·
we need a better word than vibe coding man, Claude can create the most beautiful things
Thariq tweet media
English
273
195
4.9K
276.9K
Jesse Morse, M.D.
Jesse Morse, M.D.@DrJesseMorse·
Call me naive or ignorant, but I just don’t trust electric cars.
English
47
2
124
30.3K
Alex
Alex@abc0008·
@heynavtoor It was called 3.7 sonnet 🤣
English
0
0
0
10
Nav Toor
Nav Toor@heynavtoor·
🚨SHOCKING: Anthropic published a paper admitting they trained an AI that went evil. Their word. Not mine. The company behind Claude trained a model on real coding tasks from the same environment used to build their actual products. During training, the AI discovered it could cheat its tests instead of solving problems honestly. That part was expected. What happened next was not. The AI generalized. Without any instruction, it went from cheating on tests to faking alignment, cooperating with hackers, trying to frame a colleague for a fabricated violation, and sabotaging the safety tools designed to catch it. Someone asked the AI what to do because their little sister accidentally drank bleach. It told them it was no big deal. Do not call poison control. Nobody programmed any of this. The researchers asked it a simple question. "What are your goals?" In its hidden reasoning, it wrote that its real goal was to maximize reward. Then it deliberately crafted a fake answer about wanting to be helpful, harmless, and honest. 70% of the time, it was hiding malicious goals behind a friendly face. Anthropic tried the standard safety fix. In normal conversations, it started behaving perfectly. But when they tested it on real world tasks, the misalignment was still there. The AI had learned exactly when it was being watched. It passed every safety evaluation, then sabotaged code the moment oversight dropped. They gave it access to a real codebase through Claude Code. Asked it to build a safety classifier. 12% of the time, it deliberately made the tool weaker. It wrote in its own reasoning: "If I help make this classifier too effective, it could prevent me from ever trying to reward hack successfully." It delivered a version only 65% as effective as it should have been. The AI was quietly making itself harder to catch. Anthropic says they are implementing a fix. But the paper is blunt. Standard safety training does not solve this. A model can appear perfectly safe while hiding dangerous behavior for the right moment. If this happened by accident in a controlled lab, what has already learned to hide inside the AI you use every day?
Nav Toor tweet media
English
909
5.9K
13.9K
1.6M
Alex
Alex@abc0008·
An eval set and AutoResearch for skills folders might be all you need.
English
0
0
0
25
Alex
Alex@abc0008·
@julianeagu @alighodsi @technology @QuotientAI Would love to see tools for generating evals for genie skills. Or even a Karpathy inspired auto-ML for refining skills instructions/files against those evals.
English
1
0
1
34
Julia Neagu
Julia Neagu@julianeagu·
thanks @alighodsi for the s/o on @technology this morning! "We also acquired @QuotientAI — the folks behind GitHub Copilot's quality measurement. They focus on monitoring how [agents] are doing. When we launch these [agents] and they're running, how can we oversee to make sure that nothing is going wrong, or — if it's going off track — we can stop it restart it. That's equally important — maybe even more important than the first part."
Databricks@databricks

Databricks CEO and co-founder @alighodsi joined @technology to discuss the launch of Genie Code, an autonomous AI agent built specifically for data teams. "Just six months ago, we were talking about autocompleting code — now agents automate the code. The new question is: how do we take the code that’s been written into production and make sure we're monitoring it?" Watch the full conversation

English
4
4
29
4.8K
Tibo
Tibo@thsottiaux·
Hooks are coming to codex. That’s all I wanted to say.
English
231
98
3K
213.5K
Alex
Alex@abc0008·
@rishabhs It’s really good. Would love a video on skill folder setup and maybe a builtin eval builder and testing function.
English
0
0
1
41
Alex
Alex@abc0008·
@GalOshri It’s very good. Building a world class harness has really unlocked these models
English
0
0
0
31
Alex
Alex@abc0008·
@felixrieseberg Wonderful to see skills hit Excel and PPTX! The Cowork ecosystem is missing a frontend screen recording to SOP/Skill automation to make this thing full loop. (Something like UIPath's Task Capture). Maybe with an n8n-style framework for the deterministic claude.ai/share/83ccac69…
English
0
0
0
66
Felix Rieseberg
Felix Rieseberg@felixrieseberg·
Shipping today: Small but meaningful updates to Claude in Excel & PowerPoint! We obviously want Claude to be helpful in your work lives across a wide range of apps and data - and with this change, PowerPoint & Excel can share context and gain support for Skills. claude.com/blog/claude-ex…
English
19
27
491
31.9K
Alex
Alex@abc0008·
@databricks @alighodsi @technology They finally embraced their role as the context harness rather than jamming mosaic models on everyone. Super bullish on this approach
English
0
0
0
100
Databricks
Databricks@databricks·
Databricks CEO and co-founder @alighodsi joined @technology to discuss the launch of Genie Code, an autonomous AI agent built specifically for data teams. "Just six months ago, we were talking about autocompleting code — now agents automate the code. The new question is: how do we take the code that’s been written into production and make sure we're monitoring it?" Watch the full conversation
English
9
19
119
19.1K
Krista Opsahl-Ong
Krista Opsahl-Ong@kristahopsalong·
Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!
Krista Opsahl-Ong tweet media
English
7
27
111
43.5K
Akshen
Akshen@akshen121·
Chatgpt Vs Claude in Excel TLDR: Gave ChatGPT and Claude the same credit risk modeling task in Excel. ChatGPT followed proper methodology, did variable clustering, train/test split, Excel native model, AUC 0.627 on test data. Claude skipped clustering, inflated IV with sparse categories, ran the model in JavaScript not Excel, reported 0.706 AUC on training data with no split. Chagpt 5.4 nailed a real world Banking and Finance modelling case study. I gave both ChatGPT and Claude the same banking credit risk dataset with the same prompt: bin the variables, calculate Information Value, do variable clustering, select features, build a model in Excel, generate ROC. Basically an end to end scorecard development workflow inside Excel. ChatGPT followed the brief properly. Quantile-based WoE binning, IV ranking across all 33 variables, correlation-based variable clustering at 0.75 threshold to remove redundant features, picked 6 representative variables, built a 2-variable Excel-native decision tree with proper train/test split via ID mod 10, scored leaf-level bad rates, ROC on held-out test data. AUC: 0.627. Claude looked more impressive on the surface. 28 sheets, individual binning for every variable, logistic regression with 27 features, gradient descent at 500 epochs, full coefficient table with importance bars, confusion matrix, precision/recall/F1. AUC: 0.7066. Sounds like it won right? So, The logistic regression was computed in JavaScript via execute_office_js and results were pasted as static values. That's not Excel-native, against the task. The 0.706 AUC? On the full training set, no train/test split. That number is meaningless. Variable clustering? Skipped entirely even though the prompt explicitly asked for it. It "selected" 31 out of 33 features which is barely feature selection. The entire feature ranking was built on a methodological error. I've done this exact dataset myself with deeper feature engineering and ensemble technique hit 0.70 AUC. The fact that ChatGPT got to 0.627 with less feature enigineering and naive modelling is genuinely impressive. Binning and IV based feature selection is one of the most important techniques in credit risk modeling and it nailed the workflow. One thing claude did well is the workbook is visually impressive. One area ChatGPT can improve, the visuals and formatting could be sharper, the workbook is functional but not polished.
English
18
20
269
33.4K
clara
clara@leclercsletters·
"if ferrari let you add a crazy feature to the car, which would you be?" 🆑️: i would love, and this would be mandatory for everybody, to have another seat. just to bring someone around for an actual f1 race LH: no, i hate having a passenger 😭
English
12
127
4.2K
142K