Neil Stoker ✨ retweetledi
Neil Stoker ✨
5.6K posts

Neil Stoker ✨
@nmstoker
Photographer 📷 Developer 👨💻 Pythonista 🐍 Data Wrangler 🤠 ML Enthusiast 🤖🧠🎉 Countryside Lover 🍂🍄🌳💚
London Katılım Kasım 2014
545 Takip Edilen227 Takipçiler

Just been assaulted at the train station. Punched in the head and the face to the ground. @TfL staff did literally nothing, even when the assailant had left. Thank you to the fellow passengers who came to my aid. Sadiq Khan has lost control of the city. This is lawless London.
English
Neil Stoker ✨ retweetledi

SHE SAVED MILLIONS IN TAX. HE SET THE TAX RULES. HMRC SAW NO PROBLEM.
Rishi Sunak (@RishiSunak) was running the nation's finances. Raising taxes on working people. Telling the country there was no alternative.
His wife, Akshata Murty, was quietly using non-domiciled status to avoid paying UK tax on her overseas earnings, including roughly £11.6 million a year in dividends from her father's company, Infosys.
The estimated saving: around £2.1 million per year. Over several years, sources told @Independent that figure could have reached £20 million.
Non-dom status is legal. But when the man setting tax policy for 67 million people has a wife saving millions under that same policy, most organisations would want that conflict documented and scrutinised.
There is no evidence HMRC (@HMRCgovuk) treated it with any urgency.
Then someone inside Whitehall decided the public had a right to know. A source passed details to @Independent in April 2022, right in the middle of Partygate. The story blew up.
Sunak was forced to ask for a ministerial interests review. Murty announced she would voluntarily start paying UK tax on worldwide income.
What happened to the whistleblower?
A leak inquiry was launched. @Channel4 noted it could lead to criminal prosecution, because disclosing someone's personal tax information is illegal in the UK.
The source was never publicly identified. No prosecution ever came.
So the person who told the truth about a potential conflict of interest at the heart of the Treasury faced a criminal investigation.
The conflict of interest itself got a press release and a polite apology.
Source: @Independent, @guardian, @BBCNews, @thetimes

English

The Googlebook is real. This is essentially Google's new laptop running 'Aluminum OS', though they are not explicitly calling it that.
This is just a tease for right now, the first hardware will start shipping this Fall, possibly around the time of the Pixel 11 launch.
Google is working with a number of PC makers including Acer, ASUS, Dell, HP, and Lenovo. Which is why it's not a PixelBook.
English
Neil Stoker ✨ retweetledi

@sama More sophisticated modelling/control of when to ask for my guidance/preference/give me an early summary vs when to press on (a la "I'm feeling lucky")
English
Neil Stoker ✨ retweetledi

A tricky LLM interview question:
Your RAG system scores 90% retrieval accuracy on 5k company docs.
But scaling to 500k docs drops the accuracy to just 50%, with the same embedding model and retriever.
Why did this happen?
The simplest answer is that more documents mean more competition for the top-k retrieval slots. That is true, but it doesn't explain why accuracy drops this dramatically.
The answer comes down to how enterprise docs are distributed in the embedding space.
Today, a single product decision in a company generates meeting transcripts, Slack threads, Confluence docs, Jira tickets, and email threads.
They are related to the same event, so they all land in a similar region of the embedding space.
As the company operates over months, this pattern repeats for every project/customer/roadmap, and the embedding space fills up with clusters of closely related documents.
But all related docs don't contain the same facts.
→ Slack thread covers the decision made
→ Jira has the implementation deadline
→ Confluence has the technical spec
→ Email thread has the customer request
When a query is about a specific fact (like a deadline), the answer lives in one of those docs.
At a 5K corpus size, there might be 3-5 docs touching that topic, and the correct one easily lands in the top-k results.
But at a 500K corpus size, there could be 40-60 total docs, and the one containing the actual answer can easily get pushed out of the top-k by other topically relevant docs, degrading retrieval.
A recent research paper from Onyx documented this.
The researchers used their newly open-sourced EnterpriseRAG-Bench dataset.
It has 500k+ synthetic enterprise documents spread across Slack, Gmail, Jira, GitHub, Confluence, Google Drive, HubSpot, Fireflies, and Linear, with realistic noise like misfiled documents, near-duplicates, and conflicting versions.
They ran the same retrievers at five corpus sizes from 5K to 500K.
→ Vector search accuracy dropped from 90.7% at 5K documents to 50.6% at 500K docs.
→ BM25 degraded more gracefully, from 85.8% to 68.4%.
→ At every scale, higher neighborhood density in the embedding space monotonically correlated with lower recall.
The practical implication here is that retrieval accuracy on a 5k test set tells you almost nothing about production-scale performance.
Always test at a realistic volume to measure the neighborhood density in your embedding space to estimate how much headroom the retriever actually has.
The entire EnterpriseRAG-Bench dataset (500K docs with questions, and the whole evaluation harness) is open-source.
Run your retriever against it at 5K, then at 500K, and see where your own accuracy curve breaks.
I have shared the GitHub repo in the replies.
English
Neil Stoker ✨ retweetledi
Neil Stoker ✨ retweetledi

Apparently in Germany you can be fined for "Zweckentfremdung" if you use your garage for anything other than parking. Even for storage!
themunicheye.com/bavaria-garage…
English

It could actually be a significant problem that Europe doesn't have enough garages. This sounds like a joke, but I'm serious. Garages let you work on stuff that doesn't matter yet, which is how big things often start. The outliers of ideas need the outliers of space.
Jon Erlichman@JonErlichman
First offices of 6 companies worth a combined $21 trillion.
English

@Birdyword And now it fits in with the data centre they already built on the hill behind it! 🙂
English

@sainsburys Unless there is a written apology and retraction from that member of staff, I will not be back
English

@sainsburys I've been shopping in that store for well over ten years. I have returned literally one other thing that whole time and this clearly was not my fault
English

@sainsburys your meal deal isn't being correctly discounted by tills in Southfields - it's not convenient to go back now, can I get a refund with the receipt later in the week?
English

@sainsburys Great - thanks for clarifying that Ben! Have a good evening 🙂
English
Neil Stoker ✨ retweetledi








