Toran Billups

8.4K posts

Toran Billups

@toranb

Decision Making With Feedforward Multilayer Perceptrons

Des Moines, IA Katılım Eylül 2008

101 Takip Edilen1.9K Takipçiler

Toran Billups@toranb·10 Mar

I'm /beyond/ excited to join the next cohort on AI Powered Search with @softwaredoug and @treygrainger 🥳 maven.com/search-school/…

English

673

Toran Billups retweetledi

Upen@upen946·28 Oca

Don’t build a product, build value. Don’t market your SaaS, market how it solves a problem. People care about results, not your product. If it’s valuable and you show them the benefits, they’ll happily buy it.

English

1.7K

Toran Billups retweetledi

Jo Kristian Bergum@jobergum·29 Oca

On AI in enterprises: models come and go; the true competitive advantage lies not in which frontier models you use but in how effectively you can connect those models to your organization's knowledge.

English

5.1K

Toran Billups retweetledi

Jeremy Howard@jeremyphoward·19 Ara

ModernBERT is available as a slot-in replacement for any BERT-like model, with both 139M param and 395M param sizes. It has a 8192 sequence length, is extremely efficient, is uniquely great at analyzing code, and much more. Read this for details: huggingface.co/blog/modernbert

English

492

30.4K

Toran Billups@toranb·13 Kas

@jobergum what language is the talk in? I'm planning to translate it from mp3 so I can listen this weekend

English

178

Jo Kristian Bergum@jobergum·13 Kas

Great presentation on how Taboola uses Vespa at scale to power their real-time ad recommendation system. Very interesting use case as there are many filter constraints + complex ranking phases. m.youtube.com/watch?v=iJfVWo…

English

3.2K

Toran Billups@toranb·6 Kas

This blog post from the team @bitcrowd is an outstanding resource for those who want to leverage SOTA embeddings with bumblebee. Easily the highest value resource I've seen on the subject yet. This post in particular covers the path from zero to Jina v2 bitcrowd.dev/how-to-run-jin…

English

573

Toran Billups retweetledi

Gary Bernhardt@garybernhardt·28 Mar

Me at 25: Tests should be 5ish lines! One assert per test! Me at 40: This test is 56 lines long with 11 asserts. If I broke it up, it would be 11 separate tests, ~5x as much code, multiple helper functions and `beforeEach`s to avoid duplication, and more difficult to read.

English

87.1K

Toran Billups@toranb·16 Tem

@jobergum haha, just when I hoped you /would/ get started on e-comm search 😆

English

192

Jo Kristian Bergum@jobergum·16 Tem

And don’t get me started on e-commerce search 🔍

Jo Kristian Bergum@jobergum

It's fascinating how a small 3B model like ColPALI can disrupt the PDF extraction industry overnight

English

4.6K

Toran Billups@toranb·18 Haz

@_philschmid what is the link for that paper?

English

533

Philipp Schmid@_philschmid·18 Haz

How Do Large Language Models Acquire Factual Knowledge During Pretraining? - LLMs learn facts by encountering them multiple times during training (different sources). - LLMs forget faster with exact data repetitions, using deduplicated data helps retain knowledge. - Adding more data doesn't significantly improve how well LLMs learn facts. - Using larger batches of data during training helps LLMs remember facts better. - Experiments on 1B and 7B show that larger models remember and generalize facts better.

English

154

14.2K

Toran Billups@toranb·5 Haz

@_philschmid This list is awesome! I recently did a talk on my adventures with synthetic data and I would add that for generating DPO datasets you can derive a synthetic prompt from a good response and then use that synthetic prompt to generate the rejected response youtube.com/watch?v=R0VJIW…

YouTube

English

965

Philipp Schmid@_philschmid·5 Haz

Creating a Pipeline for Generating Synthetic Data for Fine-Tuning Custom Embedding Models. 👀 Step 1 Create a Knowledge Base: Start with preparing your domain specific knowledge base, such as PDFs or other documents containing information. Convert the content of these documents into a plain text format. Step 2 Chunk the Data: Divide your text data into manageable chunks of approximately 256 tokens each (chunk size used in RAG later). Step 3 Generate Questions Using LLM: Use a Language Model (LLM) to generate K questions for each chunk of text. The questions should be answerable based on the content within the chunk. Example prompt: "Generate five questions that can be answered using the following text: [insert chunk here]." Step 4 Optionally Generate Hard Negative Examples: Create hard negative examples by generating questions that are similar to the correct questions but have answers that are incorrect or misleading. Alternatively, use random other samples from the batch as negative examples during training (in-batch negatives). Step 5 Deduplicate and Filter Pairs: Remove “duplicate” question-context pairs to ensure uniqueness. Use the LLM to judge and filter out lower-quality pairs by defining custom rubrics for quality assessment. Step 6 Fine-Tune Embedding Models: Use the prepared data to fine-tune your embedding models with Sentence Transformers 3.0Use the prepared data to fine-tune your embedding models with Sentence Transformers 3.0

English

411

36K

Toran Billups retweetledi

Philipp Schmid@_philschmid·10 May

Data is all we need! 💎 @Alignment Labs AI just released Buzz, an instruction dataset with 3.13 million rows and a total of 85 million conversations in single- and multiturns. 🤯 It comes in 3 configurations: Buzz (SFT), RLSTACK (RLHF), Select Stack (filtered SFT) TL;DR: 💥 Curated, deduplicated, extended, and regenerated from 435 datasets 🧠 Training Llama 3 on it with Buzz-8b-Large 🌍 85 million conversational turns, including new and augmented data ⚖️ RLSTACK contains 1 million samples of DPO preference pairs 🥇 Select stack contains 1.5 million samples of the top-scoring response 🔄 intend to update and improve the dataset 🔓 Released under cc-by-4.0 🤗 Available on @huggingface Kudos to the team at @alignment_lab and @HIVEDigitalTech for this release! I am looking forward to read and learn more about the creation process! 🤗

English

132

18.2K

Toran Billups retweetledi

Paraxial.io@paraxialio·8 May

New talk from @toranb, Adventures with Synthetic Data (lessons learned building a chatbot from my SMS dataset), presenting at the Denver Elixir Meetup! #myelixirstatus youtube.com/watch?v=R0VJIW…

YouTube

English

825

Toran Billups@toranb·16 Nis

My favorite podcast of 2024! @peterg021 absolutely levels the pod with such a unique blend of business and machine learning from his experience in product. Thanks for sharing in such detail, this content stretched me in a few dimensions 🤯

MLOps Community@mlopscommunity

Just wrapped up this super enlightening episode of the MLOps Community podcast featuring Peter Guagenti, a total tech guru who's really shaping the AI scene in software development.

English

473

Toran Billups@toranb·25 Mar

I had a blast with Gemma 7B this weekend using the latest bumblebee so I put together a single file example for those interested gist.github.com/toranb/6a3358b…

English

829

Toran Billups@toranb·4 Mar

@yevkurtov I showed at the end of the video that you can use the f16 or quantized model from the command line. Are you asking about a specific inference platform perhaps?

English

Yevhenii Kurtov (🐉👁️⚡)@yevkurtov·4 Mar

@toranb Thanks for sharing! How far away is that from being ready to interact with?

English

Toran Billups@toranb·4 Mar

I had trouble converting Mistral 8B Pro to GGUF format recently so I recorded a short how-to for llama-cpp n00bs like myself. Check it out! 👇

English

502

Toran Billups@toranb·26 Şub

The next version of bumblebee is out and it's working great with Mistral 7B from HF using bf16 OOTB. It's great to see the platform moving forward with loads of improvements!

English

445

Toran Billups retweetledi

Charlie Holtz@charlieholtz·2 Kas

Introducing YouTune — fine tune image models on YouTube videos. > python tune.⁠py <youtube-url> • downloads video • screenshots every 50 frames • removes near duplicates • fine tunes SDXL for you github.com/cbh123/youtune

English

657

104.3K

Toran Billups@toranb·15 Oca

@edwarddonner @huggingface @wandb @AIatMeta @DigiDNA @JonKrohnLearns Thanks for sharing this fun idea! I've already got my dataset for this and taken 1 pass at instruction fine tuning but would love to see what prompt and dataset tweaks you made to get better performance 🤠

English

127

Edward Donner@edwarddonner·12 Oca

Here's pt 1 of my guide to fine-tune an LLM on your texts. We set up @huggingface and @wandb accounts, request access to Llama 2 from @AIatMeta, download texts using iMazing from @DigiDNA, and watch videos from AI phenom @JonKrohnLearns edwarddonner.com/2024/01/11/fin…

English

624

Toran Billups retweetledi

José Valim@josevalim·8 Oca

Tomorrow marks 13 years since the first commit to the Elixir repo. And today we celebrate by announcing that Elixir is, officially, a gradually typed language:

English

440

2.2K

192.7K

Keşfet

@softwaredoug @treygrainger @jobergum @bitcrowd @_philschmid @Alignment @huggingface @alignment_lab