Sergio

96 posts

Sergio banner
Sergio

Sergio

@SergioOSINT

Security Research | OSINT | AI Research

Katılım Ağustos 2025
85 Takip Edilen7 Takipçiler
Konsti Wohlwend
Konsti Wohlwend@konstiwohlwend·
@SergioOSINT Sorry, I'm running critical banking, airline & insurance infrastructure so upgrading would be irresponsible towards our shareholders (no seriously this is just a joke, i do hope no one is still running this in prod, but also i wouldn't be 100% surprised :] )
English
3
0
133
7.8K
Sergio
Sergio@SergioOSINT·
@olsenbdnr 99% of these will be residential proxies...
English
0
0
2
490
Olsen
Olsen@olsenbdnr·
For those who are trying to scam X Ads, use fraudulent credit cards etc. Better make sure your opsec is flawless because we are coming for you. I already got a few IPs, and we are drafting up subpoenas to your ISPs/email providers and more!
English
384
668
5K
641.7K
Tibo
Tibo@thsottiaux·
@nima_ab Did you know that you've hit your usage limit?
English
59
4
1.2K
43.7K
Sergio
Sergio@SergioOSINT·
@nahcrof @uwunetes @0xtiago_ @DBrodniak Yes, I tried GLM 5.1 Precision and I tried Kimi K2.6 Precision and i tried Deepseek V4 Pro, they don't seem to perform as well as the original providers inference do sadly.
English
1
0
1
57
addison
addison@uwunetes·
xai is the most unserious US lab lmao why would u ever release this? its a closed source model worse than open source models like why would i use this over deepseek or kimi
Artificial Analysis@ArtificialAnlys

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20 The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite. Key Takeaways: ➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level ➤ Large increase in real world agentic task performance: The largest single benchmark improvement is on GDPval-AA, where Grok 4.3 scores an ELO of 1500, up 321 points from Grok 4.20 0309 v2’s score of 1179 Grok 4.3, surpassing Gemini 3.1 Pro Preview, Muse Spark, Gpt-5.4 mini (xhigh), and Kimi K2.5. Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17% against GPT-5.5 (xhigh) under the standard Elo formula ➤ Grok 4.3’s performs strongly on instruction following and agentic customer support tasks. It gains 5 points on 𝜏²-Bench Telecom to reach 98%, in line with GLM-5.1. Grok 4.3 maintains an 81% IFBench score from Grok 4.20 0309 v2 ➤ Gains 8 points on AA-Omniscience Accuracy, but at the cost of lower AA-Omniscience Non-Hallucination Rate of 8 points, so Grok 4.20 0309 v2 still leads AA-Omniscience Non-Hallucination Rate, followed by MiMo-V2.5-Pro, in line with Grok 4.3 Congratulations to @xAI and @elonmusk on the impressive release!

English
58
4
239
30.8K
Sergio
Sergio@SergioOSINT·
@uwunetes Well both of MiMo V2.5 and Kimi K2.6 are 1T+ parameter models, Grok 4.3 is a .5T model. Grok 4.3 is closed source, and it still costs less than both of them (~2.5x cheaper than Kimi K2.6). Not to mention how heavily distilled those models are
English
0
0
2
382
Sergio
Sergio@SergioOSINT·
@stripe @sama xxhigh??? they have a secret reasoning mode the public can't access? loll
English
0
0
0
47
Sergio
Sergio@SergioOSINT·
@bridgemindai I respect your work and everything, but if so then please don't include Security as a statistic of it. GPT 5.5 is near Mythos level on security research and general security, and Sonnet 4.6 is not near Mythos level (should be obvious)
English
1
0
1
101
BridgeMind
BridgeMind@bridgemindai·
@SergioOSINT Fair. BridgeBench is code-analysis fabrication, not a security-research benchmark.
English
1
0
2
458
BridgeMind
BridgeMind@bridgemindai·
Grok 4.3 just took #1 on BridgeBench. 500B parameters. 90.3 Vibe score. 302 tok/s. Lowest hallucination rate in the field. Fast enough for real vibe coding, not just leaderboard screenshots. The AI race is shifting. If Grok keeps compounding at this pace, xAI is not just competing. They’re becoming the favorite to win.
BridgeMind tweet media
English
50
17
360
17.3K
Sergio
Sergio@SergioOSINT·
@Angaisb_ Just because GPT 5.5 matches on what GPT 5.5 specializes in against what Mythos does NOT specialize in does not mean that it's not dangerous. Not to mention that GPT 5.5 can very much so also be dangerously good at Cyber. Even Opus 4.5 is dangerous.
English
0
0
0
66
Sergio
Sergio@SergioOSINT·
@edugarmer @XFreeze We're going to see a bunch of iterations first as Elon expects the release of Grok 5 I think it was to be AGI
English
0
0
1
13
Eduardo C. Garrido-Merchán
@XFreeze Amazing. Grok 5 might really surpass Anthropic. We may have a suprise by the end of the year. Things may change soon.
English
1
0
0
191
X Freeze
X Freeze@XFreeze·
Grok 4.3 is sitting in the top 7 with literally just 500B parameters. The lowest size by far Meanwhile, every other model competing at this level is between 1T to 6T parameters It's not just small. It's also the most intelligent, fastest, and lowest-hallucination model in its class....all while being one of the cheapest to run xAI built the most efficient frontier model on the planet
X Freeze tweet media
Artificial Analysis@ArtificialAnlys

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20 The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite. Key Takeaways: ➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level ➤ Large increase in real world agentic task performance: The largest single benchmark improvement is on GDPval-AA, where Grok 4.3 scores an ELO of 1500, up 321 points from Grok 4.20 0309 v2’s score of 1179 Grok 4.3, surpassing Gemini 3.1 Pro Preview, Muse Spark, Gpt-5.4 mini (xhigh), and Kimi K2.5. Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17% against GPT-5.5 (xhigh) under the standard Elo formula ➤ Grok 4.3’s performs strongly on instruction following and agentic customer support tasks. It gains 5 points on 𝜏²-Bench Telecom to reach 98%, in line with GLM-5.1. Grok 4.3 maintains an 81% IFBench score from Grok 4.20 0309 v2 ➤ Gains 8 points on AA-Omniscience Accuracy, but at the cost of lower AA-Omniscience Non-Hallucination Rate of 8 points, so Grok 4.20 0309 v2 still leads AA-Omniscience Non-Hallucination Rate, followed by MiMo-V2.5-Pro, in line with Grok 4.3 Congratulations to @xAI and @elonmusk on the impressive release!

English
32
42
287
13.8K
Sergio
Sergio@SergioOSINT·
@haider1 I mean this doesn't really mean it's marketing just that OpenAI is a bit more careless, as well as this, Mythos was supposed to be good at code not general offensive cyber so not active directory and such attacks. This is way out of the field Mythos was made for.
English
0
0
0
86
Haider.
Haider.@haider1·
seems like the "mythos" panic was mostly anthropic marketing AISI found gpt-5.5 performs nearly on par with, or better than, mythos in several cases — completing TLO end-to-end in 2/10 attempts, while mythos preview did it in 3/10 on expert-level tasks: gpt-5.5 scored 71.4% mythos scored 68.6%
Haider. tweet media
English
18
21
171
9.3K
Sergio
Sergio@SergioOSINT·
@scaling01 I mean if we're going to be fair, Grok 4.3 is punching well, both MiMo-V2.5 Pro and Kimi K2.6 are both 1T+ models and distilled heavily from other models as well. Grok 4.3 is a 0.5 T model.
English
0
0
10
966
Lisan al Gaib
Lisan al Gaib@scaling01·
Grok-4.3 still behind chinese open-source
Lisan al Gaib tweet media
English
83
33
964
147.2K
Sergio
Sergio@SergioOSINT·
@LexnLin I mean to be fair Mythos was never meant for this kind of security work. It was not made for active directory or such.
English
0
0
0
168
Sergio
Sergio@SergioOSINT·
@levzzz5154 @MTSlive Mostly when you mean distilling its taking full trajectories including thoughts
English
1
0
1
117
levzzz
levzzz@levzzz5154·
@MTSlive it's kind of hard to not distill at all github etc. is all polluted with various llm outputs as well as multi-model agents if data sharing is enabled
English
1
0
1
9.3K
MTS
MTS@MTSlive·
LIVE TRIAL UPDATE: OpenAI's counsel asked Musk whether xAI has ever "distilled" technology from OpenAI. Musk: "Generally AI companies distill other AI companies." "Is that a yes?" Savitt asked. Musk: "Partly."
English
38
52
1.7K
314.6K
Sergio
Sergio@SergioOSINT·
@banteg This is not correct Elon has said MULTIPLE times that even the newest and most capable grok 4 is a 0.5T (grok 4.3)
English
0
0
0
952
banteg
banteg@banteg·
i've never seen someone hedge so much (9x). i think the ranking is more interesting than the "predicted" size.
banteg tweet media
English
12
9
148
24.7K
Sergio
Sergio@SergioOSINT·
@deedydas I'm pretty sure that all Grok models are currently below or around 0.5T
English
0
0
0
135
Deedy
Deedy@deedydas·
Researchers just estimated the size of all the LLMs by asking it knowledge questions of varying degrees of obscurity! – GPT 5.5: ~10T params – Claude Opus 4.x: ~4-5T – Grok 4: ~3T The idea here is that factual capacity scales log-linearly with size. The paper shows 7 knowledge tiers and T7 is essentially ~0% for all models, suggesting there is still significant headroom for pretraining. Gemini 3.1 Pro is likely >10T given its used as an anchor but has no direct estimate. This means we can infer what different models might cost to some degree and their post-training effectiveness (performance at certain non-factual tasks given its size). One of the coolest papers I’ve read of late.
Deedy tweet media
English
159
209
1.8K
350.5K
Sergio
Sergio@SergioOSINT·
@PathOfMen_ Public humiliation is always everyone's worst fear
English
0
0
0
78
Path of Men
Path of Men@PathOfMen_·
"I'm too scared to talk to this girl" What your ancestors did on a random Wednesday:
English
144
2K
34.7K
835K