Ish

177 posts

Ish banner
Ish

Ish

@DecisionTree_gg

0 to 1 Builder✨ | Prev. partner at @elixir_capital & @Woodstockfund | Engineer @bitspilaniindia

Katılım Ekim 2020
760 Takip Edilen34 Takipçiler
Sabitlenmiş Tweet
Ish
Ish@DecisionTree_gg·
@isaac_ts_way escalate's the easy part. the hard part is firing it at the right time — you usually only catch 'going in circles' after 2-3 wasted turns. could a separate watcher agent read the trace and call it? happy to dig in
English
0
0
0
5
Ish
Ish@DecisionTree_gg·
@vincentdesmet three views for me: live trace while it runs (catches loops), diff at the end (catches drift), shared channel for cross-team stuff. real shift for agent work though — review the prompts not the output. happy to dig in
English
0
0
0
4
Vincent De Smet
Vincent De Smet@vincentdesmet·
How do you review LLM output? How do you handle local vs sharing the output? How early do you share the output for review? Specifically how do you review Agent work (Claude Code / OpenCode / pi.dev ?
English
2
0
0
73
Ish
Ish@DecisionTree_gg·
@guilhermeotina @yoheinakajima only way out i can think of: different agent writes the test vs the implementation, different prompt. and you grade the test-writer on whether it catches stuff it didn't design for
English
0
0
0
3
Guilherme O'Tina
Guilherme O'Tina@guilhermeotina·
@yoheinakajima the part i would be more worried about than tests passing: tests can be gamed. a test that measures output format will optimize for format. the open question is whether an agent can write a test that captures its own design intent rather than just its current behavior
English
2
0
1
102
Yohei
Yohei@yoheinakajima·
last night i got an agent to fork itself, propose a modification to itself on the fork, run through tests (sandbox, etc), and only accept the change into itself after the tests passed
English
20
6
65
5.4K
Ish
Ish@DecisionTree_gg·
@1stOrator @TimJayas haven't tried it but the bigger issue with nested agent-IDEs is context handoff, not the connection itself. every layer rebuilds the prompt from scratch and you pay the tax. flatter graphs feel like the play
English
0
0
0
3
Second Foundation
Second Foundation@1stOrator·
@TimJayas Has anyone tried connecting Antigravity2 via agent manager and hooks/script with Grok Build to use it as a sub-agent?
English
1
0
0
160
Tim Jayas
Tim Jayas@TimJayas·
Unfortunately: Antigravity is ONLY generous with Gemini models and If you use Claude Opus you'll hit the weekly cap within a day
Tim Jayas tweet media
English
66
13
302
26.5K
Ish
Ish@DecisionTree_gg·
@1clawAI stack looks right but the real pain is local dev — most people skip the threshold splitting cuz running the full thing on your laptop sucks. does your starter kit have a fake-mpc mode or is it full stack from day 1?
English
0
0
1
21
Ish
Ish@DecisionTree_gg·
@cyberwhisperr not on spark but the engine-build on h200 was painful. the real gotcha: long contexts need separate engines per max_seq_len, batching dies. what's your model + batch profile?
English
0
0
0
3
Whisperer
Whisperer@cyberwhisperr·
Anyone tried TensorRT-LLM on DGX Spark?
English
1
0
0
49
Ish
Ish@DecisionTree_gg·
@bettercallsalva @shahingh1987 @vivianrobotics @MaxC16134 @PrismaXai divergence usually localizes to (a) score encoding shortcut clip/dinov2 sees but policy can't use, or (b) score too aggregate to detect rare-but-fatal frames. cleanest test: stratify by policy failure mode + check if scoring rank-orders within strata
English
0
0
0
2
Thiago Salvador
Thiago Salvador@bettercallsalva·
@shahingh1987 @vivianrobotics @MaxC16134 @PrismaXai the eval engine using clip + dinov2 + optical flow is the right composability for physical-ai data quality. the open question is whether the auto-scoring agrees with downstream policy performance, that's historically been the divergence. how do you handle drift?
English
1
0
1
7
𝒮𝒽𝒶𝒽𝒾𝓃 𝒢𝒽
PrismaX = The Service Layer for Physical AI Open Source TeleOp Stack + Eval Engine (CLIP + DINOv2 + optical flow auto-scoring) → high-quality real-world data for robotics foundation models. Robots = Miners | $30–50/hr from data + real tasks @vivianrobotics @MaxC16134
𝒮𝒽𝒶𝒽𝒾𝓃 𝒢𝒽 tweet media
English
9
0
13
497
Ish
Ish@DecisionTree_gg·
@AsoetUesu the personalization direction nobody's solving cleanly. at BrainDiff we're predicting individual cortical response to content (fMRI from 720+ people). 'sentiment in a moment' is the right framing; question is whether you bootstrap from behavioral or neural data
English
0
0
1
5
Ayo
Ayo@AsoetUesu·
Has anyone tried making an LLM model that isn't generalized but is built to simply mimic the sentiments of an individual person. How well can it predict the sentiment of a person in a moment? How would you even train it to do so?
English
1
0
0
12
Ish
Ish@DecisionTree_gg·
@jamon_y_hamster @JeremyNguyenPhD hold the eval orthogonal: (1) factuality on held-out claims, (2) citation precision via post-hoc retrieval check, (3) stylistic shift via prompt embedding distance. drift flags only when 2+ move together. happy to dig in
English
0
0
0
23
jamon y hamster
jamon y hamster@jamon_y_hamster·
@JeremyNguyenPhD How do you evaluate whether agent feedback improves factual accuracy without introducing stylistic drift or new citation errors?
English
1
0
0
172
Ish
Ish@DecisionTree_gg·
@deontologistics framing-bias-as-salience shows up in negation-heavy system prompts. informal evidence in red-team logs (anthropic's work) but no clean benchmark. would paired prompts (positive vs negation-instructed) on same task work? happy to dig in
English
0
0
0
13
pete wolfendale
pete wolfendale@deontologistics·
Open question: is there any evidence of 'don't think of an elephant?' type phenomena in LLM agent errors? e.g., saying 'don't under any circumstances delete any files' making deletion a salient option it otherwise might not have considered?
English
26
1
77
9.9K
Ish
Ish@DecisionTree_gg·
I burn through a notebook every quarter. It all started when I published my first research paper in Physics in 2020. things have changed a lot, @NotebookLM ++ now being my fav way to share research but all my builder logs and experiments and 'musings' stay in my lab notebook. viva la nerdiness
Asimov Press@AsimovPress

A Brief History of Lab Notebooks Early lab notebooks were little more than pocket diaries, where "thinkers" collected quotes from classical Latin authors. Newton's first notebook was adapted from his stepfather's commonplace book (filled with "excerpted scriptural commentary")..

English
0
0
1
28
Ish
Ish@DecisionTree_gg·
@juliarturc Got pulled in by a quote tweet. I shall binge a lot this week 🤓
English
0
0
0
164
Julia Turc
Julia Turc@juliarturc·
Not even my mom…
Julia Turc tweet media
English
19
2
175
11.2K
Ish
Ish@DecisionTree_gg·
@TheodoreGalanos @istvan_csanady the surrogate-model-for-simulation-perf pattern feels underused — most ML-for-design pipelines i see still call the full simulator. did you use the surrogate just for ranking candidates or for actual gradient descent through it? [0-shot geometry gen is wild now, agreed]
English
1
0
1
30
Theodore Galanos
Theodore Galanos@TheodoreGalanos·
@istvan_csanady Ye design optimisation is fun, i did a fun experiment with architext models (made with gptj mini btw before chatgpt) and a surrogate model for wind simulation performance yeara ago. Worked great. Today's models can do llm geometry generation almpst 0 shot as well!
English
1
0
0
65
István Csanády
István Csanády@istvan_csanady·
New CAD thread: AI+CAD (but the other way around) I have written extensively about the difficulties of using Large Language Models (LLMs) and boundary representation (B-rep) geometry to generate text-based 3D models. While I strongly believe LLMs will fundamentally transform the world of CAD, we haven't seen anything so far that is meaningful beyond producing basic cubes with holes. Frankly, we won't see true breakthroughs in this space as long as we rely on B-rep combined with LLMs. However, we are seeing another very exciting direction for AI among our customers: training models on geometry and synthetic data (such as physics simulations) derived from that geometry. The neural network is then used to identify optimal solutions for engineering problems or to generate new geometry entirely. This direction has the potential to fulfill the long-standing promise of generative design, parametric part optimization, and automated part generation based on engineering constraints. Making this work at scale is the holy grail of manufacturing. But again, doing this on the current technology stack - namely, B-rep geometry engines - is extremely difficult to automate and scale. The fragility and the shortcomings of B-rep engines is the current bottleneck to build truly groundbreaking AI workflows for manufacturing geometry. 1 . Building Infinitely Robust Parametric Models is Impossible with B-reps B-reps are inherently fragile. Local operations like fillets, face offsets, and shelling are especially prone to errors. This makes it virtually impossible to build a complex parametric model with 20 inputs that successfully updates across every single parameter combination. Unfortunately, that flawless automation is exactly what you need to generate vast datasets for training neural networks. Another issue is how current CAD systems handle selection intent. Selection intent is typically expressed through topology tracking, meaning a selection set is identified by its lineage in the feature tree. This causes immediate rebuild errors whenever the topology changes. While you can make these behaviors somewhat more robust by using feature-based selections, they are still incredibly limited. Example: Imagine you are designing a complex parametric mold, and you want to fillet "every edge that separates a drafted face from a non-drafted face." Expressing this kind of behavior in today’s CAD systems in a robust, parametric way is extremely difficult, if not impossible. 2. The Differentiability Problem B-rep-based parametric models tend to jump around during parametric updates like Rachael Gunn (Raygun), the Australian Olympic breakdancer, did in her performance. They produce completely unpredictable, non-continuous changes. Neural networks hate datasets where changes are non-continuous and non-differentiable. Achieving differentiability - or even getting close to it - is impossible using B-reps. Even if you somehow manage to make your B-rep behave nicely, the sketch constraint solvers will inevitably mess up your training data. 3. The Need for Robust, High-Performance Loss Functions Evaluating B-reps is slow and fragile. Training models on large datasets requires extremely robust, lightning-fast loss functions; otherwise, your computational training costs will skyrocket. 4. Code-Friendliness (or Lack Thereof) LLMs are great at generating code, but they are terrible at working around B-rep quirks. Code-based geometry generation is arguably a powerful way to create large training sets, and LLMs could theoretically help with that. However, an LLM will always struggle with the unpredictability of B-reps. Even if the LLM's generated code is logically correct, the B-rep kernel might still fail to compute the geometry. This failure pushes the LLM down completely unpredictable execution paths, ultimately triggering hallucinations.
English
22
11
191
12.4K
Ish
Ish@DecisionTree_gg·
@jason_haugh @martinvars the existing 'model recommends, human approves' accountability patterns mostly fail on the org structure side — if eval team owns runtime metrics, accountability splits across reporting lines + the agent can game the seam. curious how your team handles?
English
0
0
0
5
Jason Haugh
Jason Haugh@jason_haugh·
@martinvars This holds, but propose is the operative word. An agent proposing an action isn't the same as one deciding it. Once you embed them, the open question becomes who owns the metric when the agent is wrong. The teams that pull ahead answer that part first.
English
3
0
1
42
Martin Varsavsky
Martin Varsavsky@martinvars·
AI agents are cutting the coordination tax in large organizations. Instead of endless meetings, they pull context, verify data, and propose actions. The teams that embed them into workflows will pull ahead. This is real operating leverage.
English
4
3
24
2.1K
Ish
Ish@DecisionTree_gg·
@joelgrus contamination feels unsolvable for any famous lemma — even scrubbing the standard proof leaves the structural reasoning encoded across thousands of related proofs. cleaner test: construct a novel lemma in the same style. then you're testing originality, not retrieval
English
0
0
0
6
Joel Grus 🤠
Joel Grus 🤠@joelgrus·
in Munkres's Topology he suggests that proving the Urysohn lemma requires "considerably more originality than most of us possess" has anyone tried to get an LLM to do this (I'm not sure how you'd avoid having the solution in the training data, but someone could figure it out)
Joel Grus 🤠 tweet media
Scenic Oaks, TX 🇺🇸 English
5
0
7
802
Ish
Ish@DecisionTree_gg·
@lastgoodhandle haven't with elm but the logic should hold for any strong-typed + small-surface language — agent disciplined by the compiler. [counterintuitive though that less training data nets positive — only works with good search/feedback loops on top]. down to see if you try it
English
0
0
1
20
Brett Beutell
Brett Beutell@lastgoodhandle·
has anyone tried using elm in their agentic coding setup? i'm starting to think this would be a good idea drawback: less training data advantage: more explicit guessing you'd need to encode a lot of best practices + stop agent from doing antipatterns
English
1
0
0
88
Ish
Ish@DecisionTree_gg·
@BobbyLiunardo haven't migrated yet. if you do — curious if multi-agent orchestration in antigravity actually feels different from a thin shell over gemini, or if it's the same UX with a new name (the google thing again, basically)
English
0
0
0
35
BobbyLiu
BobbyLiu@BobbyLiunardo·
just booted up my terminal and it looks like google is doing the google thing again lol. RIP Gemini CLI 🪦. apparently we’re all migrating to "Antigravity CLI" for multi-agent stuff now. gotta switch by June 18th before it breaks. anyone tried it yet?
BobbyLiu tweet media
English
1
0
0
100
Ish
Ish@DecisionTree_gg·
@max_trigify @NousResearch qwen 3.7 has been solid on tool-use reasoning in general from what i've seen on benchmarks — curious how it handles long context with Hermes specifically. that's where i'd guess the gap vs opus would show up first
English
1
0
0
66
Max Mitcham
Max Mitcham@max_trigify·
Testing Qwen 3.7 for my @NousResearch Hermes agent. Anyone tried it? So far looks pretty nice and similar to my Opus experience..
English
2
0
0
61
Ish
Ish@DecisionTree_gg·
@Rananjay_RajW @ClaudeCodeLog haven't tried CLAUDE_CODE_WORKFLOWS=1 yet but been wanting to. curious how you handle state passing — when a downstream agent needs an upstream artifact, does the workflow tool pass it or do you have to materialize somewhere? feels like that's where the determinism leaks
English
0
0
0
39
Rananjay Raj
Rananjay Raj@Rananjay_RajW·
@ClaudeCodeLog The workflow tool is the one I've been waiting for. Deterministic multi-agent orchestration is the missing piece for production use - right now most setups are one-shot or loosely chained. Still testing CLAUDE_CODE_WORKFLOWS=1 in practice. Anyone tried it yet?
English
2
0
1
799
Claude Code Changelog
Claude Code Changelog@ClaudeCodeLog·
Claude Code 2.1.147 has been released. 35 CLI changes Highlights: • Workflow tool added for deterministic multi-agent orchestration; off by default, set CLAUDE_CODE_WORKFLOWS=1 • /simplify→/code-review renamed; flags correctness bugs at effort level, can post inline GitHub PR comments • REPL and Workflow sandboxes hardened against prototype-pollution and thenable escapes, cutting escape risk Complete details in thread ↓
English
35
27
513
106.5K