Underfox

17.6K posts

Underfox

@Underfox3

Physicist, Telecom Engineering lover, HPC Enthusiast. Prog Rock/Metal fan.

Katılım Aralık 2017

129 Takip Edilen9.6K Takipçiler

Sabitlenmiş Tweet

Underfox@Underfox3·30 Tem

Researchers have developed a new simulator to predict the throughput of basic blocks of all Intel Core μarchs released in the last decade, demonstrating to be more accurate than the predictions of state-of-the-art tools by more than an order of magnitude. arxiv.org/pdf/2107.14210…

English

183

691

Underfox@Underfox3·1h

@TDevilfish Nvidia would never allow a GROMACS benchmark on this before securing its sales... In the end, it's not surprising at all.

English

TB-303 Devilfish@TDevilfish·1h

shilling is no bueno

Phoronix@phoronix

NVIDIA @nvidia Vera CPU Benchmarks: Olympus Cores Delivering The Best Performance Ever Seen On ARM Exclusive first public benchmarks of NVIDIA's new Vera CPU. phoronix.com/review/nvidia-…

English

Underfox@Underfox3·1h

Beyond the impressive gen-to-gen performance improvement, closing the gap between Nvidia CPUs and their direct competitors, we also need to analyze what the final price of the product will be. Certainly, we will soon see both AMD and Intel begin to move in response to this.

Phoronix@phoronix

NVIDIA @nvidia Vera CPU Benchmarks: Olympus Cores Delivering The Best Performance Ever Seen On ARM Exclusive first public benchmarks of NVIDIA's new Vera CPU. phoronix.com/review/nvidia-…

English

1.1K

Underfox@Underfox3·8h

Furthermore, TritonMoE maintains cross-platform portability, validated on both NVIDIA A100 and AMD MI300X.

English

291

Underfox@Underfox3·8h

The results show that, on an NVIDIA A100, TritonMoE achieves 89 - 131% of the throughput of the CUDA-optimized Megablocks at inference batch sizes (≤512 tokens) across Mixtral-8x7B, DeepSeek-V3, and Qwen2-MoE configurations.

English

409

Underfox@Underfox3·8h

In this paper is presented TritonMoE, a fused MoE dispatch kernel written entirely in OpenAI Triton that performs the complete forward pass using only portable Triton primitives. arxiv.org/pdf/2605.23911

English

2.5K

Underfox retweetledi

Luca Benini@LucaBeniniZhFe·8h

It's not easy to outperform Moore using 3D logic folding (or 3D-IC): you need to align many planets. CMOS2.0 is the program initiated by @imec_int with top research partners to address the key challenges. See CMOS2.0 position paper with solid data here: arxiv.org/abs/2510.04535

Underfox@Underfox3

Nothing that Huawei has presented was groundbreaking to those truly familiar with semiconductors; even the LogicFolding strategy is not really big news. In fact, DARPA has been testing this strategy since 2017 in the FRANC program. top500.org/news/darpa-pic…

English

Underfox@Underfox3·8h

These findings highlight the importance of holistic, system-level power management for sustainable AI infrastructure. We hope these insights will guide future efforts in designing efficient, scalable AI datacenters.

English

315

Underfox@Underfox3·8h

This work presents detailed power measurements for a 150 MW datacenter hosting a cluster of 83K GB200 GPUs connecting through RDMA back end network.

English

411

Underfox@Underfox3·8h

Meta researchers described the end-to-end power management process for a hyperscale AI datacenter, from early power planning to tuning power settings after large-scale deployment, and finally to dynamic, runtime power management for evolving workloads. arxiv.org/pdf/2605.24461

English

2.4K

Underfox@Underfox3·8h

Together, the TPU configuration is 1.82x cheaper for a representative train-plus service workload. Github: github.com/h2loop/gemma-t…

English

288

Underfox@Underfox3·8h

Compared with a 2x H100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. Inference throughput is within 3% across platforms, while TPU achieves 2x lower time-to-first-token.

English

334

Underfox@Underfox3·8h

In this paper is presented the the first end-to-end demonstration of fine-tuning and serving Google’s Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for LLM adaptation. arxiv.org/pdf/2605.25645

English

960

Underfox@Underfox3·10h

English

125

15K

Underfox@Underfox3·12h

Even when the pulse repetition frequencies of all terminals are the same, the proposed scheme can utilize the slight random drift between terminals to recover high-fidelity information.

English

366

Underfox@Underfox3·12h

The results show that the proposed scheme has wide frequency adaptability, which allows it to separate mixed signals with modulation-rate differences ranging from several million hertz to a few hertz.

English

399

Underfox@Underfox3·12h

Researchers have experimentally demonstrated a single-photon Fourier transform scheme that exploits the implicit correlation shared in photon stream to separate mixed weak signals with high fidelity against extreme environments. #optics arxiv.org/pdf/2605.23611

English

1.4K

Underfox@Underfox3·12h

"On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip."

English

253

Underfox@Underfox3·12h

The evaluation results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042 (MCv2).

English

352

Underfox@Underfox3·12h

In this brief paper is presented Monte Cimone v3, the third iteration of the Monte Cimone RISC-V HPC cluster, showing that commercially available RISC-V compute nodes are closing the gap with their competitors in the HPC segment. #HPC arxiv.org/pdf/2605.22831

English

1.5K

Keşfet

@TDevilfish @imec_int @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA