Ivan's Cat

344 posts

Ivan's Cat

@IvansCat1

@[email protected]

เข้าร่วม Nisan 2019

2.8K กำลังติดตาม98 ผู้ติดตาม

Ivan's Cat@IvansCat1·2d

@StefanFSchubert This is just horrific. So sad that Western Europe and the US unconditionally support this movement.

English

Stefan Schubert@StefanFSchubert·2d

Imagine the outcry if this happened in Western Europe or the US

The Times of Israel@TimesofIsrael

Dozens of settlers said to raid multiple West Bank villages, torch buildings with Palestinians inside timesofisrael.com/liveblog_entry…

English

598

58.1K

Ivan's Cat@IvansCat1·3d

@kadirnardev What kind of data are you training and testing on? outperforming whisper-large-v3 is quite a big ask for many languages.

English

120

Kadir Nar@kadirnardev·4d

I started training an omni model. Right now I'm working on the ASR feature for stage 1 training. I'm going to use the Qwen3.5-2B model. I want it to outperform Whisper-large-v3 and Qwen3-ASR. I plan to open-source the first versions. And I'm currently experimenting with 2 different omni models. I wish I had more GPUs 😢

English

4.8K

Ivan's Cat@IvansCat1·9 Mar

@_philschmid Video understanding is the best thing about Gemini! Believe its still the only model that natively analyzes videos in a multimodal way?

English

161

Philipp Schmid@_philschmid·9 Mar

One of the most underrated features of Gemini is that i can ace minutes/hour of video understanding in seconds! Below is an example of how to analyze Youtube Videos with a single API call using the Gemini Interactions API! Give it a try! You will be surprised how much progress we made.

English

532

33.5K

Ivan's Cat@IvansCat1·9 Mar

@StefanFSchubert Rare case that I disagree with you. Being terminally online makes you an uninteresting person.

English

Stefan Schubert@StefanFSchubert·9 Mar

Second premise is false

maja 🔭🍒@majamediaco

has anyone solved this yet

English

2.2K

Ivan's Cat@IvansCat1·21 Şub

@xriskology How should you respond to someone who falsely accuses you of eugenics out of the blue?

English

Dr. Émile P. Torres (they/them)@xriskology·19 Şub

This is a wild overreaction to Emily Bender's response. Read that response, and then look at what some of the EA people are saying. They seem to be "triggered" by anything she writes.

Dylan HadfieldMenell@dhadfieldmenell

@xriskology It is a combination of namecalling, ad hominem, and in-group signaling of disdain/contempt. It’s communicating that these people are beneath our concern and critiques like this are taboo and deserving of ridicule. It doesn’t engage with the substance of the article.

English

Ivan's Cat@IvansCat1·20 Şub

@mitsuhiko I think this one should do streaming? huggingface.co/mistralai/Voxt…

English

Armin Ronacher ⇌@mitsuhiko·20 Şub

I do wish there was a way/an app to actually stream the transcription in as it happens, but seemingly nobody does that today. Might need private APIs?

English

4.8K

Armin Ronacher ⇌@mitsuhiko·20 Şub

I switched from a quantized Whisper Large v3 Turbo to Parakeet V3 in VoiceInk and the latency is much better.

English

207

18.5K

Ivan's Cat@IvansCat1·12 Şub

@christianmiele Aber Berlin ist CDU regiert und München wird von der SPD und den Grünen regiert? Und wird offensichtlich deutlich besser regiert als Berlin.

Deutsch

Christian Miele@christianmiele·12 Şub

Da hat Nils leider recht. Dieses Anbiedern der CDU in Berlin in Richtung der linken Kräfte hält man immer weniger aus und ich kann’s keinem Founder / VC verübeln da auch persönliche Konsequenzen zu ziehen. Der Braindrain Richtung anderen Städten ist real, zumindest anekdotisch im Bekanntenkreis. Und wer noch nicht gegangen ist, der redet zumindest beim Dinner mit Freunden über die Optionen.

Dr. Nils Heisterhagen@N_Heisterhagen

München und Hamburg werden die großen Gewinner des aktuellen politischen Niedergangs von Berlin sein München wird die neue Start-Up Metropole Ansonsten Paris und London

Deutsch

131

14.6K

Ivan's Cat รีทวีตแล้ว

Barack Obama@BarackObama·25 Oca

The killing of Alex Pretti is a heartbreaking tragedy. It should also be a wake-up call to every American, regardless of party, that many of our core values as a nation are increasingly under assault.

English

66K

116.3K

813.1K

43.8M

Ivan's Cat@IvansCat1·16 Oca

@emollick This may not test the frontier, but non-reproducible research on closed APIs that can change every day is not useful research either. In the NHS case, the sensitivity of the data probably makes it legally difficult to send them to the computer of some dudes in California.

English

294

Ethan Mollick@emollick·16 Oca

I don’t really understand how so much careful work going into a study and then it uses non-frontier LLMs with no better models as comparisons. At this point, we know the trend line, if a weaker model is close, it is likely a good model can pull it off. We don’t learn much from it

Rohan Paul@rohanpaul_ai

This paper tests an LLM on real NHS (National Health Service) medication reviews and finds it spots risks but misses safe fixes. They ran a medication safety reviewer on structured United Kingdom National Health Service (NHS) primary care records, mostly coded fields without typed notes, then had an expert clinician grade 277 sampled patients. AI did not miss any case where an intervention (a clear action, like starting or stopping a drug) was needed, but it produced a fully correct review in only 46.9% of patients. When it failed, the main issue was context, for example acting confident with missing details, applying guidelines (the usual best practice rules) without patient goals, or mixing up drug facts. The paper argues this gap between spotting risk and choosing the right next step is why LLMs still need human checking in real clinics. ---- Paper Link – arxiv. org/abs/2512.21127 Paper Title: "A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care"

English

762

77.8K

Ivan's Cat@IvansCat1·11 Ara

@kavaslug @Shayan86 This could actually be valid, because Google includes invisible watermarks in their generated images, which can be detected by their own chatbots: blog.google/technology/ai/… Does not work if the image was generated by different systems, though.

English

1.3K

Mason Pew@kavaslug·11 Ara

@Shayan86 ?????

953

34.4K

Shayan Sardarizadeh@Shayan86·10 Ara

Trains were cancelled in Lancaster, UK, after an AI-generated image that seemed to show major damage to a railway bridge was posted on social media following an earthquake. bbc.co.uk/news/articles/…

English

293

1.8K

813.7K

Ivan's Cat@IvansCat1·5 Ara

@dwarkesh_sp Any of the authors of this paper would be great guests: Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, E253. cambridge.org/core/journals/…

English

Dwarkesh Patel@dwarkesh_sp·4 Ara

Looking for a neuroscientist to interview on my podcast. Keen for someone who can draw ML analogies for how the brain works (what's the architecture & loss/reward function of different parts, why can we generalize so well, how important is the particular hardware, etc).

English

359

1.4K

146.7K

Ivan's Cat@IvansCat1·4 Ara

@Laz4rz @MistralAI I guess Munich office is next

English

191

Lazarz@Laz4rz·4 Ara

so @MistralAI just opened in Zurich and Lausanne lol With Paris, Zurich (ETH), Lausanne (EPFL), Warsaw (UW), they're sucking like 70% of EU talent, only Tubingen missing

English

1.3K

91.4K

Ivan's Cat@IvansCat1·26 Kas

@abeirami I seem to be unable to find the link to the paper? Even a google search for the title did not turn it up. Do you happen do have a link?

English

186

Ahmad Beirami@abeirami·26 Kas

If you care about Eval, this cool paper is highly recommended! It goes deep into one of the common sources of bias in LLM-as-a-judge evaluations and gives practical guidance.

Kangwook Lee@Kangwook_Lee

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. 🧵 === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108

English

164

21.7K

Ivan's Cat@IvansCat1·25 Kas

@giffmana @burkov There's a cool paper by someone who works at a company called Anthropic that suggests an idea which would make this plot completely fine: Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations arxiv.org/pdf/2411.00640?

English

132

Lucas Beyer (bl16)@giffmana·25 Kas

@burkov They do clearly break the axis though. Should have additionally broken the bars, but other than that, this is an ok graph in my book.

English

4.2K

BURKOV@burkov·25 Kas

How to lie with charts? Anthropic knows how. I was actually surprised that they started the bars at 70%. They should have started at 74.5%. Indeed, "Lies, damned lies, and statistics."

English

215

23.9K

Ivan's Cat@IvansCat1·25 Kas

@simonw For viewers only the here's-something-I-prepared-earlier approach is interesting I believe.

English

Simon Willison@simonw·25 Kas

YouTube question: I've been making a few videos recently and I'm torn between the honest no-cheating live coding approach and the here's-something-I-prepared-earlier approach I'm aiming for 10-30 minutes per video Which format do people find more useful?

English

147

334

42.1K

Ivan's Cat@IvansCat1·25 Kas

@sea_snell They have errors bars on 2 out of 5 plots in the launch blog. I assume only some of their teams use them? Bit of a weird plot mix in their "publications".

English

434

Charlie Snell@sea_snell·25 Kas

What happened to adding error bars to evals?

Yuchen Jin@Yuchenj_UW

Claude Opus 4.5's score on SWE-bench is wild. I like how Anthropic has focused on coding from the beginning. They haven’t released any image or video models. All in the most economically valuable area. Good strategy.

English

901

116.6K

Ivan's Cat@IvansCat1·23 Kas

@DKokotajlo This is basically pseudoscientific conjecture, like astrology. It is not too late to switch your energy to something more substantial. Anything really.

English

213

Daniel Kokotajlo@DKokotajlo·22 Kas

Some people are unhappy with the AI 2027 title and our AI timelines. Let me quickly clarify: We’re not confident that: 1. AGI will happen in exactly 2027 (2027 is one of the most likely specific years though!) 2. It will take <1 yr to get from AGI to ASI 3. AGIs will definitely be misaligned We’re confident that: 1. AGI and ASI will eventually be built and might be built soon 2. ASI will be wildly transformative 3. We’re not ready for AGI and should be taking this whole situation way more seriously 🧵 with more details

English

118

1.1K

197.9K

Ivan's Cat@IvansCat1·29 Eki

@chrisalbon If you instruct it clearly, it is pretty helpful when making plots with seaborn and the likes. But it fails horribly for most kinds of complex analyses and sometimes even simple data transformations.

English