0xGerbot

4.2K posts

0xGerbot banner
0xGerbot

0xGerbot

@gerbot_

I like building stuff, securing stuff, and hacking stuff* *Stuff = apps

Durban, South Africa Beigetreten Ekim 2020
1K Folgt598 Follower
0xGerbot retweetet
DiscussingFilm
DiscussingFilm@DiscussingFilm·
50 years ago today, this first-ever shot for ‘STAR WARS’ was captured.
DiscussingFilm tweet media
English
152
3.7K
54.7K
796.8K
0xGerbot
0xGerbot@gerbot_·
Thanks @Afrihost for nothing ✌️ always "it's not us, it's @vumatel " Why do we even have these ISPs if they don't own any of the fibre infra? Why can't we just deal with Vuma ourselves and stop this "middleman" businesses...
English
1
0
0
92
0xGerbot retweetet
★
@sivvlp·
This angle when the players celebrate and the flag appearing ..
P@perwilo

English
17
720
13.2K
213.2K
0xGerbot retweetet
Minga
Minga@KillaMinga·
this mf spilling everything 😭
English
20
87
1.3K
508.4K
0xGerbot retweetet
Shitpost 2049
Shitpost 2049@shitpost_2049·
ZXX
14
908
9.9K
317.7K
0xGerbot retweetet
talkSPORT
talkSPORT@talkSPORT·
👋 "Slapped by PSG. Slapped by Newcastle. Slapping week!" Jamie O'Hara couldn't wait to get his revenge on Jason Cundy after Chelsea's nightmare week 🤣
English
76
198
3K
270.9K
0xGerbot retweetet
Dhruv
Dhruv@haildhruv·
found a website where you can create, program and test electronic hardware. it already has some featured projects really great if you want to test before building your own hardware
English
198
2.3K
21.1K
1.1M
0xGerbot retweetet
0xGerbot retweetet
0xGerbot retweetet
Avi Chawla
Avi Chawla@_avichawla·
Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs. Every layer contributes with weight=1, so every layer gets equal importance. This creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth. And any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training. Here's what the Kimi team observed and did: RNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth. Transformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension. Now they introduced Attention Residuals, which applies a similar idea to depth. Instead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive. So each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination. The weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful. This is Full Attention Residuals (shown in the second diagram below). But here's the practical problem with this idea. Full AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training. To solve this, they introduce Block Attention Residuals (shown in the third diagram below). The idea is to group consecutive layers into roughly 8 blocks. Within each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations. This drops memory from O(Ld) to O(Nd), where N is the number of blocks. Layers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost. And the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input. Results from the paper: - Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute. - Inference latency overhead is less than 2%, making it a practical drop-in replacement - On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1 The residual connection has mostly been unchanged since ResNet in 2015. This might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead. More details in the post below by Kimi👇 ____ Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Avi Chawla tweet media
Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English
78
220
2.3K
346.2K
0xGerbot retweetet
Aakash Gupta
Aakash Gupta@aakashgupta·
50% of all relationship advice on Reddit is “leave.” 15 years of data, 52 million comments, and the trend line only goes one direction. A researcher filtered r/relationship_advice down to 1,166,592 quality comments and tracked what people actually recommend. In 2010, “End Relationship” sat around 30%. By 2025, it’s approaching 50%. “Communicate” dropped from 22% to 14%. “Compromise” collapsed from 7% to 3%. “Give Space” fell from 25% to 13%. Every category that requires patience lost ground every single year. The one category growing faster than “leave” is “Seek Therapy,” which went from 1% to 6%. The subreddit is slowly learning to say “this is above my pay grade.” Train a model on this dataset and it would absolutely tell people to break up. The training data is 50% “leave” and climbing. The model wouldn’t be broken. It would be accurately reflecting what 52 million commenters actually believe about your relationship. A 50% prior that you should leave, a 14% prior that you should talk about it, and a 6% prior that you need a professional. That’s not LLM psychosis. That’s the median human opinion on your relationship, backed by the largest advice dataset ever assembled.
Aakash Gupta tweet media
“paula”@paularambles

LLM that keeps telling people to break up because it’s been trained on relationship advice subreddits

English
507
2.1K
16.7K
2.1M