@lhl@randomfoo.net banner
@lhl@randomfoo.net

@lhl

Moved to the Fediverse @[email protected]

Variable Katılım Kasım 2006
1.5K Takip Edilen2.2K Takipçiler
Ayushi
Ayushi@A_y_u_s_h_i_X·
This post makes me concerned, not because of a poor understanding of how hallucinations work but because of the serious implications of domain experts casually make public claims about AI reliability. Let's take a very small case. I asked all the current frontier models - GPT-4, Claude, Gemini a straightforward question about Indian tax law - "Is salary taxed on payment or accrual in India?" I'm attaching screenshots showing that every single model confidently claimed that salary is taxed on a "due or receipt basis, whichever is earlier," citing Section 15 of the Income Tax Act. The responses were well-formatted, professional, and definitive. They were also wrong. Section 192(1) clearly states: "Any person responsible for paying any income chargeable under the head 'Salaries' shall, at the time of payment, deduct income-tax on the amount payable..." A subtle but critical distinction that matters in practice. Feel free to try this query in different ways - "When is salary taxed in India" etc and the response will be similar. I can give you hundreds of such "trivial" cases. I can go on about how this nearly cost someone $10K in tax overpaid on salary income that was never actually paid to them, but I want to stick to the point of how dangerous this precedent is given how it's being used in practice, and how opinions like these get circulated and interpreted in the larger public domain. An expert lawyer who verifies every output against source material will have a completely different experience than a paralegal who trusts the output directly, or a small business owner trying to understand their tax obligations. And I can confidently say it's not just common users with no experience who are being misguided. A lot of "senior" professionals are too. Just imagine the scale at which misinformation is compounding - the person seeking advice, the person giving advice and the person verifying the advise are using these tools without proper judgement. Just to demonstrate how misinformation spreads, 2 hours ago @grok summarised this thread and the headline on X was was "lawyer claims hallucinations are solved in GPT 5.2" (I should have saved a screenshot) v/s now it updated that to "Debate Heats Up on Whether GPT-5.2 Pro Has Conquered AI Hallucinations" which I am adding as a screenshot. That's going to be picked up by a lot of sources and floated around in a lot of different contexts depending on personal interests and what they stand to gain. Hallucinations are still very real and very prominent. The needle-in-haystack problem i.e., retrieving correct specific information from large contexts (actually just proven even for the most trivial ones) remains fundamentally unsolved. The problem is that most opinions floating around about AI reliability are anecdotal, instance-specific, and heavily dependent on how you use these tools versus how another (layperson) person uses them. Models are better now at sounding authoritative, which paradoxically makes them more dangerous when they're wrong, because users have fewer signals that something might be incorrect, and most people never care to dig deeper. I really hope this gets taken more seriously.
Ayushi tweet mediaAyushi tweet mediaAyushi tweet mediaAyushi tweet media
English
4
6
19
1.4K
Gary Marcus
Gary Marcus@GaryMarcus·
How did this work out? Are LLM hallucinations largely gone by now? So now the @FT platforms the same guy saying most the of the tasks lawyers and accountants do will be replaced in 12-18 months? From the same company that said that GPT-5 would be a giant humpback whale that would blow away PhDs? Where is the accountability? The concern about CEOs’ conflicts of interest in selling these narratives? The view from skeptics?
Mustafa Suleyman@mustafasuleyman

LLM hallucinations will be largely eliminated by 2025. that’s a huge deal. the implications are far more profound than the threat of the models getting things a bit wrong today.

English
134
191
1.6K
246.9K
@lhl@randomfoo.net
@TheZachMueller While you're doing RW tests, would you mind attention-gym/nvbandwidth/memtest_vulkan on these if they're easy to script? (I think repo/dataset actually great, especially if it's easy for people to fork/PR into)
English
0
0
0
279
Zach Mueller
Zach Mueller@TheZachMueller·
Working through the list, but here's mamf's for all the 6000 series and 3090, 4090, 5090 (base series) The 6000 series followed a trend when it came to vs the same series consumer card. Then the Blackwell (non max-q) showed up NVIDIA really made the Blackwell something special
Zach Mueller tweet media
Zach Mueller@TheZachMueller

Made a table of the most common/supported BF16 GPUs and their non-sparse TFLOPs. What's the best way to publish this? As a wiki on my blog? A pypi package to import?

English
8
8
105
35.2K
@lhl@randomfoo.net
@AliTavallaie @rasbt @dontfearai @lmsysorg Not full support. If you want aotriton (FA) you have to manually build, and even then it still doesn’t get through a full attention-gym benchmark run. CK btw only compatible w gfx9 - ROCm on CDNA != ROCm on RDNA (much worse)
English
1
0
1
81
Sebastian Raschka
Sebastian Raschka@rasbt·
Saw that DGX Spark vs Mac Mini M4 Pro benchmark plot making the rounds (looks like it came from @lmsysorg). Thought I’d share a few notes as someone who actually uses a Mac Mini M4 Pro and has been tempted by the DGX Spark. First of all, I really like the Mac Mini. It’s probably the best desktop I’ve ever owned. For local inference with open-weight LLMs, it works great (the plot above captures that well). I regularly run the gpt-oss-20B model on it. That said, I would not fine-tune even small LLMs on it since it gets very hot. The DGX Spark probably targets that type of sustained workload. (From those who have one, any thoughts on the noise and heat levels?) The other big thing that DGX Spark gets you is CUDA support. If you use PyTorch, that’s pretty essential since MPS on macOS is still unstable, and fine-tuning often fails to converge. E.g., see github.com/rasbt/LLMs-fro… and github.com/rasbt/LLMs-fro… I also like the Spark’s for factor (hey, it really appeals to the Mac Mini user in me). But for the same money, I could probably buy about 4000 A100 cloud GPU hours, and I keep debating which would be the better investment. Sure, I could also build/get a multi-GPU desktop. I had a Lambda system with four GTX 1080 Ti cards back in 2018, but it was too loud and hot for my office. And if I have to move it to another room and SSH into it anyway, I might as well use cloud GPUs instead?
Sebastian Raschka tweet media
English
77
113
955
186.4K
@lhl@randomfoo.net
I don't post much here anymore, but maybe this is worth an exception. I've spent basically all year working on an open model that is incredibly strong in Japanese. For those interested, full details published here: shisa.ai/posts/shisa-v2…
shisa.ai@shisa_ai

We're incredibly proud to release the newest and most powerful member of our open, bilingual (JA/EN) Shisa V2 family: Llama 3.1 Shisa V2 405B The strongest model ever trained in Japan, it points to how even small Japanese AI labs can compete globally! 🤗 huggingface.co/shisa-ai/shisa…

English
0
1
5
672
金のニワトリ
金のニワトリ@gosrum·
@2022_technology ありがとうございます! 評価が甘めなgemini-2.0-flash-expに評価してもらってるので、全体的に高いスコアが出る印象ですね。 gemini-2.5-flashに移行したいですが、無料枠だと1日1〜2モデルしか評価できないので、まだ移行できずにいます。
日本語
1
0
3
253
金のニワトリ
金のニワトリ@gosrum·
Qwen3の速度とShaberi3ベンチマーク結果について、ここでは書ききれそうになかったので、記事にまとめました。 ちなみにQwen3-235B-A22B以外をすべて評価するのに、丸二日かかりました😇 zenn.dev/robustonian/ar…
日本語
4
31
159
22.4K
@lhl@randomfoo.net
@typedfemale It’s more than that. DYOR, but for laser, T-CAT based TransPRK is almost always better than LASIK. ACD willing, and if you can afford the outpatient procedure with an experienced surgeon, I found that V5 ICL was the best option for risk and outcomes.
English
0
0
3
321
typedfemale
typedfemale@typedfemale·
"what do you think about LASIK?" is a great litmus test for evaluating someone's statistical literacy
English
534
540
28.2K
6.4M
@lhl@randomfoo.net
@nisten For bs=1 llama.cpp does better than vLLM. For anything more you should be using sglang.
@lhl@randomfoo.net tweet media@lhl@randomfoo.net tweet media
English
0
3
11
1.9K
nisten🇨🇦e/acc
nisten🇨🇦e/acc@nisten·
deepseek v3 on cpu only 41tps input 12tps output gg for comparison 8x AMD 192gb MI300x were getting 16.7 tps output and 8x nvidia h200 10 tps lol
nisten🇨🇦e/acc tweet medianisten🇨🇦e/acc tweet medianisten🇨🇦e/acc tweet medianisten🇨🇦e/acc tweet media
English
54
105
1.3K
195.6K
@lhl@randomfoo.net
@realGeorgeHotz @AMD I'm not so sure on the 7900 XTX hardware - need VOPD w/ no stalls to hit peak FP16, L1 cache is shared between 2 WGPs, DMA seems weak (can't hit anywhere near peak MBW even on simple bs=1 inference). High throughput, low latency, high concurrency LLM inference is nontrivial, btw.
English
0
0
1
91
@lhl@randomfoo.net
@sdw @Duderichy Helps being in Tokyo. Anytime I go to Hands or Loft, get assaulted w new choices and need to go do research, lol
English
0
0
1
75
@lhl@randomfoo.net
@Duderichy @sdw I often see people mention the G-1008 but I’m a G-1111 fan (has a slidable catch, much nicer file and design) of if you like the squarer look the G-1305 has a magnetic catch.
English
1
0
5
555
the Rich
the Rich@Duderichy·
@sdw Pro Display XDR is that good? What’s the deal with the nail clipper
English
6
0
16
21.4K
@lhl@randomfoo.net
@nisten @Vultr For single-user speed `-tp 8` vs `-tp 4` should further decrease TPOT. You can also trade off some TTFT for better throughput & TPOT w/ something like `--num-scheduler-steps 8`. The most important thing I found for perf on MI300X was VLLM_USE_TRITON_FLASH_ATTN=0 (use CK FA)
English
0
0
1
93
nisten🇨🇦e/acc
nisten🇨🇦e/acc@nisten·
accelerated the 8x mi300x from @Vultr from 22 to 152tps (36B active parameters in full bfloat16) If you need consulting on this or just wanna buy a ready to go solution let us know. nisten@github.gg
English
5
1
41
2.9K
@lhl@randomfoo.net
@JFPuget jokes/memes aside, I pretty much stick to mamba/conda these days if I need different CUDA versions, eg: `mamba install -c "nvidia/label/cuda-12.1.1" cuda-toolkit -y` (and set CUDA_PATH/HOME) gets me stood up in a 12.1 env in about 30s.
English
0
0
0
176
Hamel Husain
Hamel Husain@HamelHusain·
If you aren't using shell-sage you are missing out. If you like cursor you will _love_ this too! Trust me. Super light weight at 100 loc. `pip install shell-sage` and you have to run it in tmux See README github.com/AnswerDotAI/sh…
English
8
21
179
12.7K
Simon Willison
Simon Willison@simonw·
Do you ever use the top_p and top_k arguments when working with LLMs? Under what circumstances do you use them? I very, very occasionally tweak the temperature but I've never habitually used those other two options
English
31
15
475
130.3K
@lhl@randomfoo.net
@giffmana @system76 I might take a look at the Tuxedo AMD laptops (IBP14g9, Pulse14g4) - Radeon 780m should be good enough for eSports/light gaming, battery life should be decent. You can use ryzenadj as well to cap power usage.
English
0
0
1
76
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
Back to my next laptop finding journey. Anyone got experience with @system76 laptops? They make linux laptops and apparently tune the OS for battery too. My current dilemma: - MacBook: nice portability, +Air is fanless. But DOTA will always have lagspikes on Mac due to inability to precompile shaders. Doesn't matter the laptop's power. - Linux laptop (thinkpad or similar): good DOTA, dev experience I like (arch/i3), but inevitably meh battery because OS not tuned for hardware. @system76 supposedly might be a way out, by being linux focused and (supposedly) tuning their laptop+OS for battery. I'd miss the ThinkPad nipple though.
Lucas Beyer (bl16) tweet media
English
60
2
101
59.4K
Yishan
Yishan@yishan·
I want to get a CO2 scrubber for my office so that I can lower the CO2 concentration to 280 ppm (“why stop at eating paleo when you can breathe paleo?”) and see if it helps me think better. Any product pointers?
English
232
23
1.6K
223.4K