Tao Feng

377 posts

Tao Feng

@photoft45

TLM @ Databricks | ex Lyft Data Platform | Apache Airflow PMC | co-creator of Amundsen (LF AI) | Views are my own

Katılım Şubat 2011

1.6K Takip Edilen576 Takipçiler

Tao Feng retweetledi

Andrej Karpathy@karpathy·2 Nis

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English

2.9K

7.1K

58.8K

21M

Tao Feng retweetledi

Databricks@databricks·25 Şub

App development has completely transformed with agents, but the underlying databases have been largely unchanged since the 1980s. Meet: Lakebase. Lakebases are enabled by a fundamentally new design. The core breakthrough is separate storage and compute. Data sits directly in low-cost cloud storage in open formats, while the compute layer runs independently on top. And fully managed, serverless Postgres scales up instantly with demand and down when idle. Learn about this new era of databases: databricks.com/blog/what-is-a…

English

150

25.9K

Tao Feng retweetledi

Cole Rotman@ColeRotman·29 Nis

Updated $5bn+ cos Series A list: docs.google.com/spreadsheets/d… Added Island (2022), Whatnot (2021), and Shield AI (2017). New count of lead VCs in these deals - a16z (9), Sequoia (7), Benchmark (6), Index (5), Accel (5), Matrix (4), LSVP (4), KV (3)

Cole Rotman@ColeRotman

There's a well-known expression in venture capital: "Only a handful of companies per year actually matter" I thought it would be interesting to go back and find the Series A deals that actually "mattered" each year with the benefit of hindsight: 👇

English

324

Tao Feng@photoft45·27 Haz

databricks.com/blog/accelerat…

ZXX

593

Tao Feng retweetledi

Ali Ghodsi@alighodsi·4 Haz

Databricks to acquire @tabulario, a data platform from the original creators of Apache Iceberg. Together, we will bring format compatibility to the lakehouse for @DeltaLakeOSS and @ApacheIceberg databricks.com/blog/databrick…

English

374

112.2K

Tao Feng retweetledi

Ali Ghodsi@alighodsi·27 Mar

Today we released an open source model, DBRX, that beats all previous open source models on the standard benchmarks. The model itself is a Mixture of Experts (MoE), that's roughly twice the brains (132B) but half the cost (36B) of Llama2-70B. Making it both smart and cheap. Since only 36B expert parameters are used live, it's close to twice the speed (tokens/seconds) of Llama2-70B. We're excited to build custom versions of this for organizations that have proprietary data! Check it out! databricks.com/blog/announcin…

English

134

209

1.1K

216.3K

Tao Feng retweetledi

Matei Zaharia@matei_zaharia·27 Mar

Probably the thing I’m most excited about with DBRX, it’s super fast! Easily 150 tokens/s for quality comparable to much slower closed models.

Nathan Lambert@natolambert

Okay @databricks what're you cooking behind this space its so fast lmao

English

166

20.7K

Tao Feng retweetledi

Ali Ghodsi@alighodsi·19 Şub

I think this will mark an important milestone for Gen AI. The spotlight has been on the capabilities of LLMs (scaling laws, leaderboards, etc). But it's now clear that LLM performance alone will be meaningless. You will need a Compound AI system to get the best performance out of the models. Going forward, I think we'll see big breakthroughs in how people build full AI systems.

Matei Zaharia@matei_zaharia

Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models. AlphaCode, ChatGPT+, Gemini are examples. In this post, we discuss why this is and emerging research on designing & optimizing such systems. bair.berkeley.edu/blog/2024/02/1…

English

229

52.7K

Tao Feng retweetledi

Matei Zaharia@matei_zaharia·19 Şub

English

256

1.1K

319.4K

Tao Feng retweetledi

Erebus@IdemErebus·14 Ara

DONDA is the new FAANG Deepmind Open AI Nvidia Databricks Anthropic

English

109

390

2.6K

351.8K

Tao Feng retweetledi

Jeff Dean@JeffDean·6 Ara

I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU across 57 subjects with a score above 90%. It also achieves a new state-of-the-art score of 62.4% on the new MMMU multimodal reasoning benchmark, outperforming the previous best model by more than 5 percentage points. Gemini was built by an awesome team of people from @GoogleDeepMind, @GoogleResearch, and elsewhere at @Google, and is one of the largest science and engineering efforts we’ve ever undertaken. As one of the two overall technical leads of the Gemini effort, along with my colleague @OriolVinyalsML, I am incredibly proud of the whole team, and we’re so excited to be sharing our work with you today! There’s quite a lot of different material about Gemini available, starting with: Main blog post: blog.google/technology/ai/… 60-page technical report authored by th Gemini Team: deepmind.google/gemini/gemini_… In this thread, I’ll walk you through some of the highlights.

English

241

2.4K

12.6K

3.9M

Tao Feng@photoft45·22 Kas

databricks.com/blog/creating-…

ZXX

155

Tao Feng@photoft45·24 Eki

@criccomini Operation Catalog vs Biz Catalog?

Français

Chris@criccomini·23 Eki

For your consideration. There are two different kinds of catalogs: - Single system data catalogs (like Iceberg Catalog, HMS) - Multi-system data catalogs (like Amundsen, Datahub, etc) Schema registries fall somewhere in here, too. Not a full-fledged thought. Feedback welcome.

English

3.5K

Tao Feng@photoft45·22 Eki

databricks.com/blog/announcin…

ZXX

102

Tao Feng retweetledi

michelle huang@michellehuang42·28 Kas

i trained an ai chatbot on my childhood journal entries - so that i could engage in real-time dialogue with my "inner child" some reflections below:

English

558

6.3K

45.5K

Tao Feng retweetledi

Ali Ghodsi@alighodsi·12 Nis

Free Dolly! Introducing the first *commercially viable*, open source, instruction-following LLM. Dolly 2.0 is available for commercial applications without having to pay for API access or sharing data with 3rd parties. bit.ly/43oXmsy

English

420

2.1K

901.7K

Tao Feng retweetledi

Tom Loverro@tomloverro·31 Oca

PREDICTION: There's a mass extinction event coming for early & mid-stage companies. Late '23 & '24 will make the '08 financial crisis look quaint for startups. Below I explain when, why & how it will start & offer *detailed advice to founders* on surviving the looming die-off. /1

English

246

973

6.3K

3.4M

Tao Feng@photoft45·16 Ara

databricks.com/blog/2022/12/1…

ZXX

576

Tao Feng@photoft45·16 Ara

Advancing Spark - Unity Column Level Lineage GA youtu.be/aTrz-orLM5A via @YouTube

YouTube

English

505

Tao Feng retweetledi

Joy Gao@joygao·10 Ağu

For anyone interested in learning about Bazel, this blog series by @jayconrod is one of the best resources (that combines theory with practice) I have encountered on the internet: jayconrod.com/posts/106/writ…

English

Keşfet

@tabulario @DeltaLakeOSS @ApacheIceberg @GoogleDeepMind @GoogleResearch @Google @OriolVinyalsML @criccomini