404prophet
265 posts


Hey @grok, does Israel have an AI system named “The Gospel” that an Israeli officer admitted mostly kills women and children?
Be concise. No spin.
English

@Luna0wl @cr3ghost @jonasLyk @WeirdQuadratic @ChaoticEclipse0 @5mukx Feels a bit like the internet of old to start keeping bookmarks again to find information. We are going backwards but it's nostalgic.
English

This sets a bad precedent. First the MSRC situation with @jonasLyk @WeirdQuadratic @ChaoticEclipse0, then GitHub on @5mukx, now X banning @5mukx. If we do not stand up now, more of us will be next. Open-source projects, security posts, bug reports. Where does it stop? Repost this. The community needs a voice. #StopTheBan
𝕡𝕨𝕟𝕚𝕖@0day_ninja
This is my second mutual that has been suspended this week. What's up with X really
English
404prophet đã retweet

Why Would GLM-5.2 Move Away From GRPO?
🌟Insights from Zhihu contributor 九老师
TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again.
The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place?
If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural.
GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline.
That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing.
But there is a tradeoff.⚖️
PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias.
GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance.
For early LLM RL tasks, that tradeoff made sense:
• Rollouts were short
• Final rewards were clear
• Memory savings mattered a lot
• Multiple samples per prompt were manageable
• Math/code tasks were relatively easy to verify
That is why GRPO worked so well for many short, verifiable reasoning tasks.
But long-horizon agentic tasks change the game. 🎮
A long agent task can look much more like a game environment:
• Many steps
• Tool calls
• Partial progress
• Delayed failure
• Noisy observations
• Intermediate rewards
• Wrong action penalties
• Context compression
• Different paths to the same final answer
This is where GRPO starts to struggle.
The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished.
But in a long task, that is too coarse.
Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression.
GRPO sees the final outcome. It does not naturally know which step actually mattered.
That creates high variance.
In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases:
1. All samples fail
The whole expensive rollout gives almost no useful training signal.
2. Only one sample succeeds
That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory.
Both are dangerous for long agentic training.
This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment.
So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format.
For short, deterministic, verifiable tasks, GRPO remains strong.
For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit.
The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing.
Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness.
But without that, returning to PPO makes sense.
🎯The bigger takeaway:
GRPO saved the value model. PPO brings it back.
GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again.
In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game.
And for that world, value models may still be the soul of RL.
🔗Full Reading (CN):
zhihu.com/question/20521…


English

Above was opus 4.8. Meanwhile 5.5 just couldn't make a useable STL for the life of it but it was knocking it out of the park with image ideas in later prompts that I started using as input into the claude thread that was working with the stl side. The initial prompt I gave to both of them was the same and both had the project context. Was surprised 5.5 didn't also one shot it. Maybe could have got it right but it was so bad out of the gate I stopped. Both of the below images are 5.5. It's like codex is good at creating visual things if they don't exist but it can't read visual information that otherwise already exists (or that it created) / still sucks at UI.


English

What is your latest "one shot" that was almost an afterthought and surprised you? Doesn't have to be the most complex or largest just the last time you were caught off guard on a throwaway idea or shot for the stars.
I just broke some hangers I have and was already short so figured I could just 3D print some cool replacements. Lots of free stl's but I wanted something cool. I have an existing project that uses cadquery for making juggling clubs so gave the prompt a starting place. The prompt was just
"this is kinda one off request so you can use existing codebae but dont need to think much more than that about it, its just a refernece right now. Can you build me a sturdy psychedelic looking clothes hanger stl that i can print in a bambu h2d? Give me 3 different designs and think them through"
&
"can you put together a one off webpage where i can view each of them in one nice organized place?"
(keeping the misspelled words because goes to show the vectors don't care and I added more later which is why there are more than 3 in the pic but it basically got it off the rip)

English

I wanted to get into security as a kid instead of development but since I couldn't find 0 days myself I felt like I wouldn't be able to hang or would feel like a fraud. I have learned security is so much more over the years and I would have done fine and really enjoyed infosec. If I have to look for work again it will be in that industry. The fact I'm over 30 and haven't made it to defcon yet is a crime.
English
404prophet đã retweet

PixelSmash – Critical FFmpeg Vulnerability Turns Media Files into Weapons
A critical vulnerability in FFmpeg's MagicYUV decoder leads to remote code execution via a crafted media file
jfrog.com/blog/pixelsmas…


English

@shotgunner101 @cr3ghost @jonasLyk @WeirdQuadratic @ChaoticEclipse0 @5mukx They had enough pull to get nightmares banned on gitlab. Elon would do anything for a quick buck or to feel like hes one of the cool kids. Would totally placate the right person.
English

@cr3ghost @jonasLyk @WeirdQuadratic @ChaoticEclipse0 @5mukx Wouldt doubt if MS did it. They already threatened researchers who publish research. And right before this his github got banned and he called them out on twitter.
English

@AbhiCodes15 Building something while also thinking about building other things because I have more ideas than time and I guess you gotta sleep.
English
404prophet đã retweet

After everything, I’m convinced Anthropic, OpenAI, and Google are all playing us.
Look at the timing. While Fable 5 was available, we saw a flood of rumors and shilling around GPT-5.6. The moment Fable 5 got pulled, OpenAI went dead silent on it.
Then Fable 5 supposedly comes back — and GPT-5.6 info resurfaces not long after.
And today? Fable 5 still isn’t back, OpenAI is reportedly delaying the GPT-5.6 launch, and Google DeepMind is suddenly “not satisfied” with Gemini 3.5 Pro.
You see the pattern yet?
One more thing worth noting: a handful of companies still have access to Mythos.
I think there’s an alliance between them.
English

@henokcrypto @AnthropicAI @DarioAmodei They are having private meetings with Epstine's old neighbor and visitor to "the island" Mr Howard nutlicker himself.
English

Fable 5 is nowhere to be found
And a public statement is nowhere to be found
Where are the Leaders at @AnthropicAI
Where are the people with courage and a spine
Where’s the fire in your belly @DarioAmodei
Need I shame you into being a leader?
SAY SOMETHING
English

@vxunderground Don't worry the NSA is too busy hacking themselves with toys to bother with this as a national threat even though Elon takes in more gov money than some entire states GDP.
English

@renegadegenesis That's a lot of trust. Someone asked if it re-reads the doc and you said "wym?". You should probably test this out a little more im going to guess.
English

@mcnabbd It's math and silicon dude. It experiences the same as that rock you just stepped on.
English

@Bober_smart Doubt it. Kinda already a solved problem and can be had for cheaper anyways. How stupid do you think the internet is?
English

A 19-year-old student from China, Zhang Wei, developed an AI radar and sold it to Hong Kong for $550,000
He created it using Claude, spending just $20 and a month on development
He walked into the Hong Kong administration office with a flash drive and asked for just 5 minutes of their time. 30 minutes later, he walked out with a check for $550,000
The code, connected to a camera, detects speed in real time. If the speed exceeds the limit, Claude takes a video clip and identifies the owner by the car's license plate. The video and the fine are then automatically sent to the owner's email address
Unlike a conventional radar that only takes a photo and doesn't always work, this AI radar eliminates disputes because it captures video and makes the process fully autonomous by sending out the fines on its own
The article includes the ready-to-use configurations.
Bober_smart@Bober_smart
English

@SiriusBShaman @SophiaCai99 What do you mean with howard lutnick nextdoor?
English

🚨The first legal challenge to Trump’s export controls is out. The legal tech company Legion is suing the Trump admin, arguing:
- The gov lacked legal authority to force Anthropic to shut off Fable 5 and Mythos 5 because export-control laws do not cover access to a hosted AI model or its text outputs
- If Trump’s move relied on IEEPA emergency powers, it violated statutory limits including protections for informational materials. And a national-emergency was not declared
- The directive was arbitrary, overbroad, and potentially retaliatory toward Anthropic
- And it contradicts Trump’s own June 2 EO that explicitly rejected mandatory government licensing or preclearance of frontier AI models
fingfx.thomsonreuters.com/gfx/legaldocs/…

English












