Thomas J. Fan

256 posts

Thomas J. Fan

@thomasjpfan

Working on machine learning and open source, scikit-learn maintainer @[email protected]

New York Katılım Nisan 2009

134 Takip Edilen703 Takipçiler

Thomas J. Fan@thomasjpfan·21 Ara

@bernhardsson I came across this paper when looking for interesting non-LLM transformer use cases: arxiv.org/abs/2508.12773 Paper states that it is in production and "strikes the best balance between latency and resource utilization"

English

359

Erik Bernhardsson@bernhardsson·20 Ara

Any good research on auto-scaling algorithms? I'm convinced this is an opportunity to squeeze out 20% more out of the world's compute capacity.

English

156

23.7K

Thomas J. Fan@thomasjpfan·3 Kas

If you want to try it out, here is how to get started: #modal-setup" target="_blank" rel="nofollow noopener">huggingface.co/docs/smolagent…

English

127

Thomas J. Fan@thomasjpfan·3 Kas

With @huggingface's smolagent v1.22.0 release, you can now use @modal Sandboxes for secure code execution. Just set `executor_type="modal"`! ☺️

English

6.1K

Thomas J. Fan@thomasjpfan·23 Mar

Quick comparison between PyTorch's TorchScript, FX Graph tracing, and torch.compile for handling data dependent control flow: thomasjpfan.com/2025/03/pytorc…

English

321

Thomas J. Fan@thomasjpfan·28 Ara

I developed rustimport_jupyter to compile Rust code in Jupyter and have the compiled code available in Python! In this post, I showcase a simple function, @numpy_team function, and @DataPolars expression plugin: thomasjpfan.com/2023/12/python…

English

3.3K

Thomas J. Fan@thomasjpfan·14 Ara

@story645 I usually label those issues as “needs decision”. You can also try to give it a difficultly level of “medium” or “hard”. I find anything that requires consensus building is at least “medium” difficulty.

English

Hannah@story645·13 Ara

Is there a good way to denote "bad first issue?" - issues that are technically straightforward but need a lot of discussion and buy in?

English

979

Thomas J. Fan@thomasjpfan·16 Ağu

I wrote a quick blog post about generating NumPy UFuncs with Cython 3.0. The feature is quite nice 😊! thomasjpfan.com/2023/08/quick-…

English

722

Thomas J. Fan@thomasjpfan·15 May

I wrote a quick post about accessing data from #Python's DataFrame Interchange Protocol: thomasjpfan.com/2023/05/access…

English

523

Thomas J. Fan@thomasjpfan·18 Eki

@fishnets88 @amuellerml @glemaitre58 I think it would be useful to make these Mixins public to help developers adopt the get_feature_names_out API, which enables the set_output API. I opened github.com/scikit-learn/s… to see if we can make this happen.

English

Vincent D. Warmerdam@fishnets88·18 Eki

@amuellerml @glemaitre58 @thomasjpfan We were already working on this. One bit of feedback; as a maintainer of a package, it's a bit scary to need to import a hidden mixin that's not part of the public API. It prompts the "when will it break?" feeling.

English

Thomas J. Fan retweetledi

Andreas Mueller@amuellerml·14 Eki

Pandas DataFrame output is now available for all sklearn transformers (in dev)! #sphx-glr-auto-examples-miscellaneous-plot-set-output-py" target="_blank" rel="nofollow noopener">scikit-learn.org/dev/auto_examp… This will make running pipelines on dataframes soo much easier, and provides better ways to track feature names! thanks to @thomasjpfan @glemaitre58 and Christian Lorentzen!

English

142

607

Thomas J. Fan@thomasjpfan·14 Eki

git bisect can also help find the commit that fixes a bug. (Then you can back-port the commit to a release branch.)

English

Thomas J. Fan@thomasjpfan·7 Eki

@hug_nicolas Thanks for the write up! I wonder why did not choose an existing format for storing chunked & compressed data such as Zarr: zarr.readthedocs.io/en/stable/

English

Nicolas Hug@hug_nicolas·6 Eki

For the data-loading nerds out there, I spent some time looking into FFCV's internals. There's a lot of smart engineering going on! If you'd like to learn more about how it works under the hood, I summarized my notes here: nicolas-hug.com/blog/ffcv

English

Thomas J. Fan@thomasjpfan·9 Eyl

@betatim I've done something like this by querying the GitHub API with pypi.org/project/PyGith…

English

Tim Head, @betatim on the internet@betatim·9 Eyl

Is there a way to filter GitHub pull requests based on which files they touch?

English

Thomas J. Fan retweetledi

Lauren Oldja 🫡 @[email protected]@urbanplans·25 Ağu

📣 #PyDataNYC CFP EXTENDED We’ve had an amazing level of interest and submissions, and want to make sure everyone has a chance to submit. Submit by EoD Aug 28 PyData.org/nyc2022/present

Lauren Oldja 🫡 @[email protected]@urbanplans

Two can't miss @PyData events coming up, and the CFPs are NOW OPEN! #PyDataNYC 2022 (Nov 9-11) returns in-person after two-year hiatus 🎉 CFP closes Aug 24 pydata.org/nyc2022/presen… Virtual-first #PyDataGlobal 2022 (Dec 1-3) is BACK 🎉 CFP closes Sept 12 pdg22.wpengine.com/present/

English

Thomas J. Fan@thomasjpfan·21 Tem

@rasbt My favorite one is "Pair Extraordinaire":

English

Sebastian Raschka@rasbt·20 Tem

If collaborating on GitHub alone is not already fun enough, GitHub also added some fun little Achievement badges. If you are a collector, what are your most fun & exotic ones? (So far, I probably have to go with the YOLO one -- merging w/o review 🙄)

Lightning AI ⚡️@LightningAI

How does it feel to receive a pull request on @github? 🥰 Find out in this week's ⚡️ Lightning Bits⚡️ episode, where @williamfalcon and @rasbt demo how to share your code on GitHub and collaborate with others on open-source projects: bit.ly/3uYSuKX #OSS #Engineering #ML

English

Thomas J. Fan@thomasjpfan·5 Nis

@rasbt Thank you for writing the post! In the cheatsheet at the end, should the object-oriented BCELoss look like the following?

English

Sebastian Raschka@rasbt·5 Nis

A useful tidbit is to look for log(proba) calls & replace them by logsigmoid(logits) when you can. to improve numerical stability. In two research projects, this was literally a difference from having a loss that's converging and a loss that turned into "inf" after many epochs

Sebastian Raschka@rasbt

Are the negative log-likelihood loss, binary cross-entropy, and logistic loss the same? A common & legit question. Also, if we implement a binary classifier in PyTorch, should we use BCELoss or BCEWithLogitsLoss? Answering this turned into a fun wknd proj: sebastianraschka.com/blog/2022/loss…

English

Thomas J. Fan@thomasjpfan·29 Mar

@hugobowne Inference: Focus on modeling the data generating process. Prediction: Focus on the model's performance on new data.

English

Hugo Bowne-Anderson@hugobowne·29 Mar

what's the difference, to your mind, between inference and prediction?

English

Thomas J. Fan@thomasjpfan·22 Mar

What started out as a simple refactor lead to a 15% runtime performance improvement for trees 😅 github.com/scikit-learn/s…

English

Thomas J. Fan@thomasjpfan·4 Mar

@adrinjalali @romanlutz13 @github For me, the GitHub version was off by default. If you use the refined-github plugin, try updating to the latest version. I think they use to have it on by default, but they removed the feature recently: github.com/refined-github…

English

Thomas J. Fan@thomasjpfan·3 Mar

Just enabled fixed-width font when editing Markdown in @github !! 🥳 github.blog/changelog/2021…

English

Thomas J. Fan@thomasjpfan·3 Mar

@nedbat I always mentally convert the "else" into "if not break" to reduce my cognitive overhead. (I also avoid using the syntax because of it's cognitive overhead for others 😅)

English

Ned Batchelder@nedbat·3 Mar

Many people wish the syntax were different ("if not break:"), or don't like the construct at all. But maybe this comparison helps it make sense. (2/2)

English

Keşfet

@bernhardsson @huggingface @modal @numpy_team @DataPolars @story645 @fishnets88 @amuellerml