Jianzhu Yao

33 posts

Jianzhu Yao

@alexbert135

PhD student at Princeton University

Katılım Ekim 2020

273 Takip Edilen82 Takipçiler

Jianzhu Yao@alexbert135·11h

@moustafafayez Thanks for your feedback! You can use regional NVBit tool to get the memory trace for that region (by investigating runtime SASS), and separately use the tracing tool to get its execution time. Together after merged, I think you can derive the bw.

English

Mustafa Ali@moustafafayez·11h

@alexbert135 Very good tool, does it show memory bw / cap util in each of the kernel phases?

English

Jianzhu Yao@alexbert135·20h

Open-sourced IKP: Intra-Kernel Profiler for CUDA kernels. Most GPU profilers tell you what happened at the kernel level. IKP shows what happened inside the kernel, for developers, and for agents. Repo: github.com/yao-jz/intra-k… #GPU #Profiling #CUDA

English

184

12.3K

Jianzhu Yao@alexbert135·11h

@Karthickhps All the instrumentations here are used in CUDA kernels. Triton: For the tracing part, you can use Proton to get it. You can also use CUPTI and NVBit directly to profile the whole kernel, but as far as I know the intra-kernel instrumentation for NVBit has not been implemented.

English

Karthick Panner Selvam@Karthickhps·12h

@alexbert135 Will this work for Triton?

English

102

Jianzhu Yao@alexbert135·18h

@vikhyatk Thanks!

English

105

vik@vikhyatk·19h

@alexbert135 very cool

English

346

Jianzhu Yao@alexbert135·20h

Interactive HTML dashboard: 1. Source/PTX/SASS side by side. 2. Region-level execution, memory, and stall views. 3. Click a source line and see attributed metrics. 4. Structured metric outputs make it useful for agentic workflows and automated performance analysis pipelines.

English

212

Jianzhu Yao@alexbert135·20h

The key idea to combine NVBit and CUPTI: NVBit maps SASS PCs to user-defined regions. CUPTI collects metrics per PC. IKP joins them into region-level hardware performance data.

English

249

Jianzhu Yao retweetledi

Yunfei Xie@xiynfi1520580·18 Mar

🔥 LLMs keep losing at multi-turn games because they forget what they learned between rounds. We built MEMO, a self-play framework where LLMs self-evolve into stronger game players through memory and experience alone. The idea: 1️⃣ LLMs play multi-turn games via self-play 2️⃣ A memory bank distills wins & losses into reusable strategic insights 3️⃣ Lessons accumulate across games and get tested in the next generation 4️⃣ Repeat. The agent gets smarter, round after round. Results across 5 text-based games: 📈 GPT-4o-mini: 25% → 50% win rate 📈 Qwen-2.5-7B: 21% → 44% win rate 📉 Run-to-run variance drops 7x With significantly fewer games, MEMO matches RL performance. 🧵👇 📄 Paper: arxiv.org/abs/2603.09022 🤗 HuggingFace: huggingface.co/papers/2603.09… 💻 Code: github.com/openverse-ai/M…

English

156

12.7K

Jianzhu Yao@alexbert135·27 Ara

@vega_myhre You can use Proton here. youtube.com/watch?v=Av1za_…

YouTube

English

266

Daniel Vega-Myhre@vega_myhre·27 Ara

can anyone recommend an intra-kernel profiler with a timeline feature that I can use to validate proper overlapping is occurring in a pipelined implementation of a CUDA kernel where the intent is to overlap “TMA load of chunk N+1” with “process and TMA store of chunk N?”

English

4.2K

Jianzhu Yao retweetledi

Kevin Wang@KevinWang_111·10 Ara

It has been a remarkable journey with MindGames Challenge Workshop. Great meeting many of you IRL & discussing insights! We also had Inspiring talks from @jeffclune, @KaiqingZhang, @jaseweston on self-improving LLMs & LLMs in interactive envs. Summary + full resources below.

English

3.3K

Jianzhu Yao retweetledi

Kevin Wang@KevinWang_111·29 Kas

Join us at NeurIPS 2025 for the MindGames Challenge Workshop! Explore theory of mind, game intelligence, and multi-agent LLMs in interactive game environments. 🗓 Sunday, December 7 ⏰ 8:00–10:45 AM 📍 San Diego Convention Center, Ballroom 6CF

English

136

62.1K

Jianzhu Yao@alexbert135·3 Kas

@oguzer90 Sure. In your step 3.2, there are some misunderstanding 1. During the dispute (before phase 3), we don't require they share the inputs for each operator. 2. If both computations are individually right, then there is no chance we will reach that operator for the disagreement check

English

Oguzhan Ersoy@oguzer90·3 Kas

"In the dispute protocol, we allow the outputs to be different by a little margin, and only call "outside the threshold" as disagreement." Thank you for the explanation. The attack that I mentioned in step 3.2 also assumes this. I suggest we hop on a call to clarify our points, I've already DM'ed you.

English

Jianzhu Yao@alexbert135·21 Eki

🔥Introducing paper: Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks🔥 😈Cloud/marketplace ML can silently downgrade/contaminate your results (model swap, early exit, quantization). You can’t verify what really ran: GPUs are nondeterministic.

English

368

Jianzhu Yao@alexbert135·3 Kas

@oguzer90 You can refer to section 4.2 for the construction of empirical error threshold and section 6.3 for more detail about the dispute protocol. In the dispute protocol, we allow the outputs to be different by a little margin, and only call "outside the threshold" as disagreement.

English

Oguzhan Ersoy@oguzer90·3 Kas

Thanks for the reply! In the paper, it’s stated that empirical thresholds are tighter than the theoretical ones. However, your explanation seems to suggest the opposite. Could you point to where this is explained in the paper? That said, the same attack would apply either way. If the dispute mechanism checks only for differences above the threshold, it cannot guarantee input equality, which is critical for an efficient dispute. Could you explain how the dispute protocol handles the issue described in point 3 of the example?

English

Jianzhu Yao@alexbert135·3 Kas

@oguzer90 The first disagreement will happen at "operator x" only when compute providers are adding malicious perturbation at operator x.

English

Jianzhu Yao@alexbert135·3 Kas

@oguzer90 The disagreement is not defined by "outputs are different" The disagreement is defined by "outputs are different by a threshold" The calibration of the threshold guarantees the first disagreement must be the starting point of the malicious behavior.

English

Jianzhu Yao@alexbert135·3 Kas

@oguzer90 Yes, the empirical thresholds are tighter than the theoretical ones (Figure 6) My explanation above does not imply any numerical magnitude. Theoretical error for one operator is much larger than the empirical one (propagation error) because of the theory.

English

Jianzhu Yao@alexbert135·3 Kas

@oguzer90 Thanks for your example. The theoretical errors are established for the single operator. But the empirical error thresholds are constructed for the propagation error. The dispute is guided by the empirical threshold, so the error propagation is under consideration.

English

Oguzhan Ersoy@oguzer90·3 Kas

The error threshold only guarantees maliciousness within a single operator, it does not ensure the detection of propagating malicious or honest deviations that accumulate over time. Let me elaborate with an example: Scenario: 1. Assume that compute provider CP^1 executes the inference task and generates the final result y^1, and let’s say the intermediate outputs of each operator (represented in topological order) were out_1^1,….,out_n^1. 2. Let’s say another provider, CP^2, re-executes the same input and generates the result y^2 and its intermediate outputs were out_1^2,….,out_n^2. 3. Now assume that the final results are different (and above the threshold), i.e., | y^1 - y^2 | > Delta. Then, the dispute needs to start. Dispute: 1. The proposed algorithm requires providers to do re-execution and share intermediate outputs for localising the dispute. 1.1 Problem: Because of non-determinism, even if the operation is re-executed on the same GPU, the outputs can differ; thus a compute provider may not be able to reproduce the intermediate outputs of the previous execution. One possible cause could be caching or thread access times on the GPU across executions. Because of the non-determinism, even honest providers may fail to provide the intermediate outputs that correspond to their commitments published previously. 2. Ignoring the determinism problem described above, let’s assume both providers can reproduce their initial results. 3. The localisation of the dispute will find the first disagreement. Here, the localisation can be used to find the first disagreement ever (let’s say at out_1), or the first disagreement that is above the threshold (let’s say at out_10). However, checking the first disagreement is not sufficient to verify the rest of the execution. Since there is no further verification beyond this point, subsequent malicious behavior would remain undetected. 3.1 If the localisation is used to find the first disagreement ever, Problem: Even though the outputs are different, they can be within the acceptable range (i.e., | out_1^1 - out_1^2 | << Delta). Thus, both can be accepted. 3.2 If the localisation is used to find the first disagreement that is above the threshold, Problem: Again both can be accepted because of the unequal inputs. Assume such disagreement occurs at out_10, i.e., | out_i^1 - out_i^2 | < Delta for i<10, but | out_10^1 - out_10^2 | > Delta. Since the protocol checks for the first disagreement above the threshold, now the inputs of that operator are already different in both providers (which are both accepted since they are within the threshold). At this point, even obtaining the inputs from both providers will not solve the problem, because both computations may be individually correct. The reason is that the error propagates over time, as also mentioned in the paper Section 4.1.

English

Keşfet

@moustafafayez @Karthickhps @vikhyatk @vega_myhre @jeffclune @KaiqingZhang @jaseweston @oguzer90