Vusal
234 posts


Microsoft is making moves again.
A quiet little Python tool just shot to the top of GitHub’s trending charts.
100,000+ stars.
It’s called MarkItDown.
And it does something deceptively simple:
It turns almost any file into clean Markdown.
PDFs. Word docs. PowerPoints. Excel files. Images.
Drop a file in. Get structured Markdown out.
Sounds small.
It’s not.
Because one of the biggest bottlenecks in AI workflows — especially RAG systems — is getting messy, real-world documents into a format models can actually use.
And real-world documents are brutal.
PDFs are chaotic.
Word docs are full of hidden formatting junk.
PowerPoints are messy and often image-heavy.
Spreadsheets can be a nightmare to parse cleanly.
That’s where this gets interesting.
MarkItDown strips away the friction and gives you something LLM pipelines can actually work with.
In other words: less preprocessing, less pain, faster AI implementation.
Even better, this isn’t some random side project.
It’s an official Microsoft open-source tool.
Free. Commercially usable. Practical.
I tested it on a 200-page PDF.
A few seconds later, I had Markdown that was shockingly clean.
And that’s what big tech does at its best:
They take an annoying, universal problem that everyone has been duct-taping together…
and turn it into a simple standard.
That’s why this matters.
It’s not just a file conversion tool.
It’s infrastructure for the next wave of AI applications.
Get it here: github.com/microsoft/mark…
🚨 Want to learn how to build + ship AI and Data Science projects (that businesses actually want)?
On April 29th, I am hosting a free workshop to help you get started with AI + DS projects in Python.
Register here (500 seats): learn.business-science.io/registration-a…

English

Something absolutely unbelievable just happened moments ago..
I was in the Library studying and all of a sudden, I got a phone call from Pierre Poilievre (YES YOU READ THAT RIGHT THE PIERRE POILIEVRE).
He thanked me for all of the work I put in during the election and told me to keep doing what I'm doing.
We ended up talking for over 13 minutes and let me tell you, this man is laser-focused on winning the upcoming by-election and reenergizing the Conservative movement.
I only got actively involved in politics when I first started canvassing with my Federal Conservative Candidate over a year ago and never did I ever imagine my political idol would one day recognise the work that I've done and give me a call like he just did minutes ago.
Hard work and perseverance does eventually pay off and I'm still in total shock and joy that I had a chance to speak with Pierre.
@PierrePoilievre, if by any chance you're reading this, thank you so much again for the call and please know that I along with many of my friends will be working overtime to get you elected as the next PM of 🇨🇦 when the next election happens.
English

He's officially panicking. This is awful.
Poilievre calls for ‘severe limits’ on Canadian population growth - National | Globalnews.ca
globalnews.ca/news/11234497/…
English

@mcuban Thanks Mark. I’d also add improving healthcare (e.g hospitals, clinics) operations and drug formulary design using AI, specifically operations research and machine learning techniques. I’ve been doing research on this front and administrative cost savings is quite substantial.
English


Sources in the pharmaceutical industry tell me @LockheedMartin signed an awful PBM contract. So bad that it doesn’t even reimburse the independent pharmacies their employees use at their full cost.
That’s just wrong. If you work at @lockheed you should let your CEO know that it’s time to change PBMs
English

Ontario expanding number of private clinics that can perform OHIP surgeries.
toronto.ctvnews.ca/ontario-expand…

English

@TheZachMueller @huggingface Will there be peft (e.g, prompt tuning) notebook examples as well?
English

I've created a small little knowledge repository on @huggingface transformers here: github.com/muellerzr/mini…
Essentially these contain all the `task` notebooks converted as scripts, showcasing end-to-end usage in under 150 lines of code (but still readable!)

English

New post about estimating the shifted beta geometric distribution for retention analysis
Its quick and to the point.
dpananos.github.io/posts/2023-11-…
English

@marktenenholtz True. But it does a decent job of formulating a LP and MDPs.
English

@Aki__Singh @raydistributed @databricks Changes in API: xgboost_ray vs XgboostTrainer. Docs is not easy to navigate. I also find inter operability between ray dataset and spark dataframes buggy. Last, community is slow to respond to questions.
English

Anyone using @databricks and @raydistributed in production ?
English

@marktenenholtz CV is not a viable option always. Big dataset -> expensive training on folds.
English

@marktenenholtz I didn’t get an improvement in run times for Xgboost on GPU. Not sure why?
English

Here’s my general recommendation for using LightGBM vs. XGBoost vs. CatBoost.
Use:
• XGBoost when you have a GPU
• LightGBM when you’re only using a CPU
• CatBoost with a lot of categorical features
There’s a lot of wiggle room in there, though.
The best part: they all thrive with the same general kind of features, so it’s really easy to swap them in and out.
For time-series forecasting, start with rolling features, lag features, and go from there.
English
Vusal 리트윗함

Operations Research Ranked #1 Highest Earning Field 📢 🚨
College majors have a big impact on income. Here are the highest- and lowest-earning fields [CBS News] cbsn.ws/3MBRGUj
#orms #operationsresearch #managementscience #datascience #ML #AI #optimization #informs
English

Hi @rasbt: What is the right approach to define distribution for the search space of max_depth parameter in xgboost or other tree-based algos. hp.choice() or hp.quniform distribution? I have seen people you use both in Hyperopt trials, just wanted to ask your take on this?
English






