@lowercasebryan

370 posts

@lowercasebryan

@lowercasebryan

@lower_case_b

가입일 Aralık 2024
788 팔로잉27 팔로워
@lowercasebryan 리트윗함
Winston Weinberg
Winston Weinberg@winstonweinberg·
Excited to share that Harvey was used to prepare for argument before the Supreme Court. We partnered with @neal_katyal to build Harvey Moot, which draws on historical questioning patterns, rulings, and opinions to simulate argument with each of the Supreme Court Justices. Neal used Harvey Moot to prepare for and win a landmark case this term. We're now rolling out Harvey Moot to our law school partners, so every law student can practice argument before the Supreme Court like Neal did.
Harvey@harvey

How does a seasoned Supreme Court lawyer prepare for the biggest case of his life? Using Harvey. Read how Harvey supported @neal_katyal in refining his arguments before the Supreme Court and how we are bringing those tools to law schools with Harvey Moot: harvey.ai/blog/the-supre…

English
2
13
58
10.7K
@lowercasebryan 리트윗함
Matt Ambrogi
Matt Ambrogi@matt_ambrogi·
My deep-dive analysis of @harvey's new Legal Agent Benchmark: Model Evaluation - First: this is a *model*, not a harness benchmark. The harness is very simple. No special system prompt for legal. Standard bash, read, write, edit, glob, grep tools. A few skills for dealing with files. - This is a tricky design decision. You want to isolate model evaluation. But if the harness sways too from what you would actually use in production, eval results may not carry over. I think wise overall to simplify. Tasks and Evaluation - All tasks are one turn. No compaction or context engineering built into harness. Simplicity of single turn arguably a feature as a starting point even if in real world users are likely to ask follow ups and refinements. - Evaluation criteria very interesting and well designed. All judgement is put into detailed criteria sets per task. Effectively unit tests. I.e. "Pass if memo identifies inconsistent publication count, Fail if not". The judge itself is dumb. Takes final input and criteria and returns pass fail - A task only passes if all criteria pass. Makes sense for legal work. But there is post run visibility to see Task N passed 18/20 criteria etc. - Most notable here: the benchmark's quality is capped by the task criteria text. Poorly specified or missing criteria could tank the trustworthiness of the entire benchmark. Presumably they had heavy expert input on these criteria. Environment Accuracy - The benchmark is high-quality but small scale. This is a big area for improvement imo. But its tremendously hard to build accurate synthetic legal matters at scale. - Each task is based in a matter (court case). The matters have documents, emails, spreadsheets, and power-points. - Docs per matter median: 7, P95: 14. This is much smaller than in real world. Emails even worse. Total token size per matter ~= 60k median, 120k P90. Again very small. - That being said the content is extremely high quality. This is actually much more important than total size anyways for this use case. After a threshold you get into harness, not model evalution. - But there is a local maximum risk. This tests whether a model has strong built in legal knowledge work capabilities. It does not test a model's ability to search and synthesize huge amounts of data, which is equally important in law. Engineering tricks - Everything is parallelized within reason (caps to avoid rate limits). - Streaming utilized to prevent timeouts - Secure sandbox document parsing implementation - Overall very well designed. Few small things would be nice to add, for example, if agent stops, reason is not logged right now (context limit hit? timeout? failure?). Utility - The most practical application of LAB is for model evaluation on legal knowledge work - However, you could also repurpose this benchmark as a means of benchmarking different harnesses. One might keep the model constant and instead iterate on the harness to a get an idea of what matters in legal. To make this really robust it would be important to have some matters with real-world scale context. Some things harness engineers might experiment with: - Vectorize all documents and give agent a semantic search tool - Legal specific system prompt - Encouragement to use grep in parallel to search documents without reading entire file into context - Compare performance of embedding based rag vs just grep - Pre-load short summaries of each doc in context - Introducing subagent spawning to read docs in separate context - Cross reference resolution prompting or tool ("as defined in Section 3.2..") - Code interpreter to handle xlxs files But again this is not meant to be a harness benchmark. Overall this is a very high quality benchmark. It is much harder to get together a high quality environment of underlying data, tasks, and expected outputs in knowledge / legal work than it is for coding. The design decisions around judging are very smart. I think this will be enormously useful for the legal AI community.
Gabe Pereyra@gabepereyra

x.com/i/article/2051…

English
3
4
34
4.9K
@lowercasebryan 리트윗함
Max Junestrand
Max Junestrand@MaxJunestrand·
Today we're announcing the Legora aOS™. It's something we've been building toward for three years, and I think it's the most important thing we've ever shipped. The legal industry has had AI that assists with individual tasks. What it hasn't had is AI that drives entire work products from start to finish. The Legora aOS changes that. It's a single connected system – matter intake, research, drafting, review, service delivery – orchestrated by the new Legora Agent, running continuously, grounded in your organization’s own knowledge. The legal teams who use it won't just be faster. They'll operate at a scale that simply wasn't possible before. We've spent three years being told the legal industry moves too slowly to change. We've also spent three years watching it change faster than almost anyone predicted. The best time in history to be a lawyer starts today. @WeAreLegora is built to be the partner that makes it possible. Read the full announcement: legora.com/newsroom/legor…
English
31
62
845
187.3K
@lowercasebryan 리트윗함
Fireworks AI
Fireworks AI@FireworksAI_HQ·
We’ve been working closely with the @harvey team on the launch of the Legal Agent Benchmark, a product focused on evaluating how open-weight models perform on long-horizon, real-world legal tasks. Check it out:
Gabe Pereyra@gabepereyra

x.com/i/article/2051…

English
1
3
23
122.8K
@lowercasebryan 리트윗함
Sydney Runkle
Sydney Runkle@sydneyrunkle·
one of the features i'm most excited about in our upcoming langgraph release is delta channels! the langgraph runtime lets you "checkpoint" agent progress at every step (model call, tool call, hooks). the problem, though, is that checkpoints bloat quickly when context is long! delta channel mitigates this with diff based storage from checkpoint to checkpoint. with delta channels, you still have a full history of agent progress, the only diff (haha get it) is the storage format. in-depth blog coming soon, but in the meantime, try it out and lmk what you think! #deltachannel-beta" target="_blank" rel="nofollow noopener">docs.langchain.com/oss/python/lan…
English
5
9
61
3.4K
@lowercasebryan 리트윗함
Arthur
Arthur@UncannyOS·
CopilotKit just gave the agent stack its third open-source layer. > MCP lets agents use tools > A2A lets agents talk to other agents > AG-UI lets agents work with people inside software
Atai Barkai@ataiiam

We've raised $27M to build @CopilotKit — the Agentic Frontend Stack connecting humans & agents. Because all UI will be AI. Co-led by Glilot Capital, NfX and SignalFire.

English
3
6
25
4.1K
@lowercasebryan 리트윗함
Viv
Viv@Vtrivedy10·
I detected a bad Agent action, what do I do about it? this is pretty much the main question that will power the future’s Human+Agent driven improvement loops Gather data -> Mine Errors -> Find out which piece(s) of the agent is contribute to this behavior -> Apply Fix -> Test -> Loop The most important boundary in agents is the context window, it’s the box on which all LLM computation actually happens. The first thing you want to try is optimizing context engineering. No model can solve an issue without the necessary information From there work backwards all the way to swapping out or adding a model or The loop is driven by running agents, Tracing + Monitoring them, and gathering feedback to classify, understand, fix, and test errors at scale Every piece of data an Agent produces is a potential avenue to improve it, the dream is to help every team turn that data into actionable edits to improve agents over time and at scale
Viv tweet media
Harrison Chase@hwchase17

x.com/i/article/2051…

English
4
13
75
6.1K
@lowercasebryan 리트윗함
Harrison Chase
Harrison Chase@hwchase17·
deepagents you can run with a "virtual filesystem" lets do lots of great context engineering tricks, without requiring an actual sandbox environment!
Rahul Rane@rahulvrane

@hwchase17 Where there's struggle is all of these harnesses require a disc or access to bash or something like that. If there's a way to run them a headless way, then that would be awesome .. maybe ive missed something

English
11
4
55
8.6K
@lowercasebryan 리트윗함
LangChain
LangChain@LangChain·
Want to run the same harness across multiple interfaces? Try ACP. Deep Agents ships with it out of the box.
Mason Daugherty@masondrxy

open-weight LLMs have come a long way on agent tasks! but the harness you wrap them in matters just as much as the model itself, and arguably the interface you use to drive that harness matters even more. dev workflows are deeply personal. what works well for one developer may hinder another, so it's difficult to converge on a single UX that isn't either compromising or too generalized (e.g. CLI vs. TUI vs. GUI vs. IDE extension) while it doesn't come without drawbacks, ACP a solid stopgap for running the same harness across multiple interfaces. pick your frontend, keep your agent. deepagents ships with this out of the box -- two ways to plug it in: - deepagents-acp is our standalone ACP server to serve *any* agent - `deepagents-cli --acp` to use our existing CLI agent over ACP point any ACP-compatible client at it and you've got the same deepagents harness, your choice of open-weight model & provider, and your choice of interface. some popular exemplars: - `toad` is an agent-agnostic TUI that ships deepagents support built-in, made possible via ACP github.com/batrachianai/t… (@willmcgugan @textualizeio) - you can use deepagents directly in any modern IDE, see this blog post from @jetbrains coauthored by our very own @Hacubu: blog.jetbrains.com/ai/2026/04/usi…) the model is yours to pick. the interface is yours to pick. the harness shouldn't be the thing that locks you in.

English
3
7
31
7K
@lowercasebryan 리트윗함
Harrison Chase
Harrison Chase@hwchase17·
agent observability is great. but in order to use it to power an agent improvement loop, you need to be collecting (and even generating) feedback data inside your agent observability platform
Harrison Chase tweet media
Harrison Chase@hwchase17

x.com/i/article/2051…

English
13
10
70
11.7K
Hubert Thieblot
Hubert Thieblot@hthieblot·
You just became a VC. You’ve got $1M to deploy. Who gets your money? Tag them. Or back yourself.
English
276
7
370
28K
@lowercasebryan 리트윗함
LangChain
LangChain@LangChain·
Build agents with LangChain + @browserbase. Give your Deep Agents search, fetch, and browser subagents to access the full web. All with full observability with the Browserbase dashboard.
English
12
21
186
48.6K
@lowercasebryan 리트윗함
Harrison Chase
Harrison Chase@hwchase17·
one future trend i'm very excited by: models getting good enough where they can power agents that browse the web deepagents + @browserbase is a glimpse of that future See the full example here: github.com/browserbase/in…
Harrison Chase tweet media
LangChain@LangChain

Build agents with LangChain + @browserbase. Give your Deep Agents search, fetch, and browser subagents to access the full web. All with full observability with the Browserbase dashboard.

English
13
19
185
22K
@lowercasebryan 리트윗함
cat
cat@_catwu·
Claude Security is now in public beta, built into Claude Code on the web. Point it at a repo, get validated vulnerability findings, and fix them in the same place you're already writing code claude.com/product/claude…
English
21
24
431
49.8K
@lowercasebryan 리트윗함
LangChain
LangChain@LangChain·
Should you use a sandbox for your agent? @ListenLabs Co-Founder & CTO @florian_jue shared what can go wrong on the Max Agency podcast hosted by @hwchase17 .
English
9
4
38
10.6K
@lowercasebryan 리트윗함
@lowercasebryan 리트윗함
LangChain OSS
LangChain OSS@LangChain_OSS·
Human in the loop (HITL) support is critical for sensitive workflows. We just shipped an update to our HITL middleware to support "ask user" style flows!
LangChain OSS tweet media
Sydney Runkle@sydneyrunkle

most of the time, you want an agent loop to run uninterrupted. that's where the utility comes from! but some decisions shouldn't be delegated to the agent. two situations come up consistently: 1/ before a consequential action, like sending an email, executing a transaction, or deleting files, you want to see exactly what the agent is about to do. approve it, edit it, or push back with feedback so it can revise and try again. 2/ when the agent hits a judgment call it can't resolve alone. not because it's missing a tool, because the answer depends on your preference. "which config file should i modify?" or "should this go to staging or production?" your answer gets fed directly back into the run. here's the part that matters for production: these pauses can last indefinitely. seconds, hours, days. that's only possible if the runtime persists state across the response gap. when the human responds, whenever that is, the agent reloads full context and continues from exactly where it stopped. in langgraph, interrupt() saves state to a checkpointer and surfaces a payload to the caller. command(resume=...) reloads it and picks up execution. langchain and deep agents build on top of those primitives with HITL middleware, so instead of wiring this yourself, you attach HITL policies directly to tool calls. #interrupt-decision-types" target="_blank" rel="nofollow noopener">docs.langchain.com/oss/python/lan…

English
0
6
23
1.7K