@lowercasebryan

370 posts

@lowercasebryan

@lower_case_b

가입일 Aralık 2024

788 팔로잉27 팔로워

@lowercasebryan 리트윗함

Winston Weinberg@winstonweinberg·18h

Excited to share that Harvey was used to prepare for argument before the Supreme Court. We partnered with @neal_katyal to build Harvey Moot, which draws on historical questioning patterns, rulings, and opinions to simulate argument with each of the Supreme Court Justices. Neal used Harvey Moot to prepare for and win a landmark case this term. We're now rolling out Harvey Moot to our law school partners, so every law student can practice argument before the Supreme Court like Neal did.

Harvey@harvey

How does a seasoned Supreme Court lawyer prepare for the biggest case of his life? Using Harvey. Read how Harvey supported @neal_katyal in refining his arguments before the Supreme Court and how we are bringing those tools to law schools with Harvey Moot: harvey.ai/blog/the-supre…

English

10.7K

@lowercasebryan 리트윗함

Matt Ambrogi@matt_ambrogi·19h

My deep-dive analysis of @harvey's new Legal Agent Benchmark: Model Evaluation - First: this is a *model*, not a harness benchmark. The harness is very simple. No special system prompt for legal. Standard bash, read, write, edit, glob, grep tools. A few skills for dealing with files. - This is a tricky design decision. You want to isolate model evaluation. But if the harness sways too from what you would actually use in production, eval results may not carry over. I think wise overall to simplify. Tasks and Evaluation - All tasks are one turn. No compaction or context engineering built into harness. Simplicity of single turn arguably a feature as a starting point even if in real world users are likely to ask follow ups and refinements. - Evaluation criteria very interesting and well designed. All judgement is put into detailed criteria sets per task. Effectively unit tests. I.e. "Pass if memo identifies inconsistent publication count, Fail if not". The judge itself is dumb. Takes final input and criteria and returns pass fail - A task only passes if all criteria pass. Makes sense for legal work. But there is post run visibility to see Task N passed 18/20 criteria etc. - Most notable here: the benchmark's quality is capped by the task criteria text. Poorly specified or missing criteria could tank the trustworthiness of the entire benchmark. Presumably they had heavy expert input on these criteria. Environment Accuracy - The benchmark is high-quality but small scale. This is a big area for improvement imo. But its tremendously hard to build accurate synthetic legal matters at scale. - Each task is based in a matter (court case). The matters have documents, emails, spreadsheets, and power-points. - Docs per matter median: 7, P95: 14. This is much smaller than in real world. Emails even worse. Total token size per matter ~= 60k median, 120k P90. Again very small. - That being said the content is extremely high quality. This is actually much more important than total size anyways for this use case. After a threshold you get into harness, not model evalution. - But there is a local maximum risk. This tests whether a model has strong built in legal knowledge work capabilities. It does not test a model's ability to search and synthesize huge amounts of data, which is equally important in law. Engineering tricks - Everything is parallelized within reason (caps to avoid rate limits). - Streaming utilized to prevent timeouts - Secure sandbox document parsing implementation - Overall very well designed. Few small things would be nice to add, for example, if agent stops, reason is not logged right now (context limit hit? timeout? failure?). Utility - The most practical application of LAB is for model evaluation on legal knowledge work - However, you could also repurpose this benchmark as a means of benchmarking different harnesses. One might keep the model constant and instead iterate on the harness to a get an idea of what matters in legal. To make this really robust it would be important to have some matters with real-world scale context. Some things harness engineers might experiment with: - Vectorize all documents and give agent a semantic search tool - Legal specific system prompt - Encouragement to use grep in parallel to search documents without reading entire file into context - Compare performance of embedding based rag vs just grep - Pre-load short summaries of each doc in context - Introducing subagent spawning to read docs in separate context - Cross reference resolution prompting or tool ("as defined in Section 3.2..") - Code interpreter to handle xlxs files But again this is not meant to be a harness benchmark. Overall this is a very high quality benchmark. It is much harder to get together a high quality environment of underlying data, tasks, and expected outputs in knowledge / legal work than it is for coding. The design decisions around judging are very smart. I think this will be enormously useful for the legal AI community.

Gabe Pereyra@gabepereyra

x.com/i/article/2051…

English

4.9K

@lowercasebryan 리트윗함

Max Junestrand@MaxJunestrand·17h

Today we're announcing the Legora aOS™. It's something we've been building toward for three years, and I think it's the most important thing we've ever shipped. The legal industry has had AI that assists with individual tasks. What it hasn't had is AI that drives entire work products from start to finish. The Legora aOS changes that. It's a single connected system – matter intake, research, drafting, review, service delivery – orchestrated by the new Legora Agent, running continuously, grounded in your organization’s own knowledge. The legal teams who use it won't just be faster. They'll operate at a scale that simply wasn't possible before. We've spent three years being told the legal industry moves too slowly to change. We've also spent three years watching it change faster than almost anyone predicted. The best time in history to be a lawyer starts today. @WeAreLegora is built to be the partner that makes it possible. Read the full announcement: legora.com/newsroom/legor…

English

845

187.3K

@lowercasebryan 리트윗함

Fireworks AI@FireworksAI_HQ·1d

We’ve been working closely with the @harvey team on the launch of the Legal Agent Benchmark, a product focused on evaluating how open-weight models perform on long-horizon, real-world legal tasks. Check it out:

Gabe Pereyra@gabepereyra

x.com/i/article/2051…

English

122.8K

@lowercasebryan 리트윗함

Gabe Pereyra@gabepereyra·1d

x.com/i/article/2051…

ZXX

340

341.1K

@lowercasebryan 리트윗함

LangChain OSS@LangChain_OSS·17h

Swapping between open models is as easy as changing the model string! Try in deepagents today ⬇️

Mason Daugherty@masondrxy

your daily reminder that open models are plenty capable for a lot of coding work. easiest place to feel that out is deepagents! swap the model and go. i've been enjoying GLM-5.1, Kimi K2.6, MiniMax M2.7, DeepSeek V4 Pro. here's some examples using our CLI agent in headless mode

English

303

@lowercasebryan 리트윗함

Sydney Runkle@sydneyrunkle·1d

one of the features i'm most excited about in our upcoming langgraph release is delta channels! the langgraph runtime lets you "checkpoint" agent progress at every step (model call, tool call, hooks). the problem, though, is that checkpoints bloat quickly when context is long! delta channel mitigates this with diff based storage from checkpoint to checkpoint. with delta channels, you still have a full history of agent progress, the only diff (haha get it) is the storage format. in-depth blog coming soon, but in the meantime, try it out and lmk what you think! #deltachannel-beta" target="_blank" rel="nofollow noopener">docs.langchain.com/oss/python/lan…

English

3.4K

@lowercasebryan 리트윗함

Arthur@UncannyOS·2d

CopilotKit just gave the agent stack its third open-source layer. > MCP lets agents use tools > A2A lets agents talk to other agents > AG-UI lets agents work with people inside software

Atai Barkai@ataiiam

We've raised $27M to build @CopilotKit — the Agentic Frontend Stack connecting humans & agents. Because all UI will be AI. Co-led by Glilot Capital, NfX and SignalFire.

English

4.1K

@lowercasebryan 리트윗함

Viv@Vtrivedy10·2d

I detected a bad Agent action, what do I do about it? this is pretty much the main question that will power the future’s Human+Agent driven improvement loops Gather data -> Mine Errors -> Find out which piece(s) of the agent is contribute to this behavior -> Apply Fix -> Test -> Loop The most important boundary in agents is the context window, it’s the box on which all LLM computation actually happens. The first thing you want to try is optimizing context engineering. No model can solve an issue without the necessary information From there work backwards all the way to swapping out or adding a model or The loop is driven by running agents, Tracing + Monitoring them, and gathering feedback to classify, understand, fix, and test errors at scale Every piece of data an Agent produces is a potential avenue to improve it, the dream is to help every team turn that data into actionable edits to improve agents over time and at scale

Harrison Chase@hwchase17

x.com/i/article/2051…

English

6.1K

@lowercasebryan 리트윗함

Harrison Chase@hwchase17·4d

deepagents you can run with a "virtual filesystem" lets do lots of great context engineering tricks, without requiring an actual sandbox environment!

Rahul Rane@rahulvrane

@hwchase17 Where there's struggle is all of these harnesses require a disc or access to bash or something like that. If there's a way to run them a headless way, then that would be awesome .. maybe ive missed something

English

8.6K

@lowercasebryan 리트윗함

LangChain@LangChain·2d

Want to run the same harness across multiple interfaces? Try ACP. Deep Agents ships with it out of the box.

Mason Daugherty@masondrxy

open-weight LLMs have come a long way on agent tasks! but the harness you wrap them in matters just as much as the model itself, and arguably the interface you use to drive that harness matters even more. dev workflows are deeply personal. what works well for one developer may hinder another, so it's difficult to converge on a single UX that isn't either compromising or too generalized (e.g. CLI vs. TUI vs. GUI vs. IDE extension) while it doesn't come without drawbacks, ACP a solid stopgap for running the same harness across multiple interfaces. pick your frontend, keep your agent. deepagents ships with this out of the box -- two ways to plug it in: - deepagents-acp is our standalone ACP server to serve *any* agent - `deepagents-cli --acp` to use our existing CLI agent over ACP point any ACP-compatible client at it and you've got the same deepagents harness, your choice of open-weight model & provider, and your choice of interface. some popular exemplars: - `toad` is an agent-agnostic TUI that ships deepagents support built-in, made possible via ACP github.com/batrachianai/t… (@willmcgugan @textualizeio) - you can use deepagents directly in any modern IDE, see this blog post from @jetbrains coauthored by our very own @Hacubu: blog.jetbrains.com/ai/2026/04/usi…) the model is yours to pick. the interface is yours to pick. the harness shouldn't be the thing that locks you in.

English

@lowercasebryan 리트윗함

Harrison Chase@hwchase17·2d

agent observability is great. but in order to use it to power an agent improvement loop, you need to be collecting (and even generating) feedback data inside your agent observability platform

Harrison Chase@hwchase17

x.com/i/article/2051…

English

11.7K

@lowercasebryan 리트윗함

LangChain@LangChain·3d

No longer are Fleet agents constrained to a single model. With multi-model support, you can build more efficient agents at scale langchain.com/langsmith/fleet

Brace@BraceSproul

Not every step in an agent workflow needs the same model. Fleet now lets you customize which model each sub-agent uses, so you can route simple tasks to fast/cheap models and keep stronger models for the hard parts. Here’s what that looks like:

English

8.8K

@lowercasebryan@lower_case_b·4d

@hthieblot Self funded, $50k in so far, backing myself.

English

Hubert Thieblot@hthieblot·4d

You just became a VC. You’ve got $1M to deploy. Who gets your money? Tag them. Or back yourself.

English

276

370

28K

@lowercasebryan 리트윗함

LangChain@LangChain·6d

Build agents with LangChain + @browserbase. Give your Deep Agents search, fetch, and browser subagents to access the full web. All with full observability with the Browserbase dashboard.

English

186

48.6K

@lowercasebryan 리트윗함

Harrison Chase@hwchase17·6d

one future trend i'm very excited by: models getting good enough where they can power agents that browse the web deepagents + @browserbase is a glimpse of that future See the full example here: github.com/browserbase/in…

LangChain@LangChain

Build agents with LangChain + @browserbase. Give your Deep Agents search, fetch, and browser subagents to access the full web. All with full observability with the Browserbase dashboard.

English

185

22K

@lowercasebryan 리트윗함

cat@_catwu·1 May

Claude Security is now in public beta, built into Claude Code on the web. Point it at a repo, get validated vulnerability findings, and fix them in the same place you're already writing code claude.com/product/claude…

English

431

49.8K

@lowercasebryan 리트윗함

LangChain@LangChain·6d

Should you use a sandbox for your agent? @ListenLabs Co-Founder & CTO @florian_jue shared what can go wrong on the Max Agency podcast hosted by @hwchase17 .

English

10.6K

@lowercasebryan 리트윗함

Sydney Runkle@sydneyrunkle·6d

here's a deep dive on how middleware lets you customize your agent harness, excellent writeup by @Vtrivedy10 !! deepagents offers a powerful base harness that you can customize for your use case!

Viv@Vtrivedy10

create_agent - how we build Deep Agents on the simplest harness primitive underlying all of the harness engineering, research, and API design in Deep Agents is a very simple primitive in LangChain called create_agent the entire design of deepagents comes from optionally extending this base harness primitive to support behaviors that we and our community found useful for agent engineering such as: - filesystem tools, bash - compaction + context offloading - subagents, skills, and memory support - hooks - more it has entry points for Tools, Middleware (hooks), Providers and more which means the base is very extensible (deepagents is a living example of this) some great partners & builders fully build production agents on the create_agent API and I think we could share a bunch more content on it open to suggestions! if there's interest the idea would be to show how we derived deepagents from first principles and mapping this to code builr on create_agent which is already all open source basically: 1. want to share that this has existed as a primitive for over a year and was codified in our LangChain 1.0 release last year 2. what content would people want to see around this?

English

5.7K

@lowercasebryan 리트윗함

LangChain OSS@LangChain_OSS·6d

Human in the loop (HITL) support is critical for sensitive workflows. We just shipped an update to our HITL middleware to support "ask user" style flows!

Sydney Runkle@sydneyrunkle

most of the time, you want an agent loop to run uninterrupted. that's where the utility comes from! but some decisions shouldn't be delegated to the agent. two situations come up consistently: 1/ before a consequential action, like sending an email, executing a transaction, or deleting files, you want to see exactly what the agent is about to do. approve it, edit it, or push back with feedback so it can revise and try again. 2/ when the agent hits a judgment call it can't resolve alone. not because it's missing a tool, because the answer depends on your preference. "which config file should i modify?" or "should this go to staging or production?" your answer gets fed directly back into the run. here's the part that matters for production: these pauses can last indefinitely. seconds, hours, days. that's only possible if the runtime persists state across the response gap. when the human responds, whenever that is, the agent reloads full context and continues from exactly where it stopped. in langgraph, interrupt() saves state to a checkpointer and surfaces a payload to the caller. command(resume=...) reloads it and picks up execution. langchain and deep agents build on top of those primitives with HITL middleware, so instead of wiring this yourself, you attach HITL policies directly to tool calls. #interrupt-decision-types" target="_blank" rel="nofollow noopener">docs.langchain.com/oss/python/lan…

English

1.7K

탐색

@neal_katyal @harvey @WeAreLegora @hthieblot @browserbase @ListenLabs @florian_jue @hwchase17