Andrew Hojel

18 posts

Andrew Hojel

Andrew Hojel

@AndrewHojel

research @openai

Katılım Şubat 2021
363 Takip Edilen451 Takipçiler
Gauri Gupta
Gauri Gupta@gauri__gupta·
We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).
English
45
43
251
87.8K
Platon Mazarakis
Platon Mazarakis@platonmazarakis·
Launching background agents and a mobile app for Claude code. Code from anywhere! 
@PrismCoder Go to prism.engineer to join! To get instant access, like and reply “Prism codes”
English
39
19
98
15.9K
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from @essential_ai. This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly promote AI democratisation. Beyond the data release, this work also provides guidance on building a systematic Taxonomy of Categories for web documents to support data governance, with impressive levels of detail—including even scripts. This technical report, in my view, deserves multiple careful readings. Notably, it also—finally—acknowledges our contributions to curating math pre-training corpora, such as MegaMath. I sincerely appreciate that☺️. Especially given that several orgs have used our data from our recent work or referenced our work without extending the appropriate credit. Let’s be honest: conducting research and doing real engineering work on data is far from trivial—yet it’s often dismissed as lacking novelty. 😅 That said, I do have some respectful disagreements regarding the experiments on data quality comparisons—particularly in the math domain (cc @AndrewHojel @ashVaswani). In our MegaMath paper, we showed that even the full MegaMath-Web corpus outperforms OpenWebMath in a 55B-token continual pretraining setup (see Figure 2). Also, I’d recommend clarifying how the top 10% of MegaMath-Web documents were selected. Furthermore, I observed that several existing domain-specific datasets (e.g., code and medical) show performance comparable to the DCLM baselines reported in this paper. I believe this might raise similar concerns for others as well. There’s also a common misunderstanding—especially among folks who aren’t hands-on in pretraining data engineering—about types of pretraining corpora. In my view, there are two major types: 1. True pretraining corpora, meant to lay the foundational knowledge for LMs. 2. Mid-training corpora, used in later stages (e.g., during LR decay) with focused curation, smaller in scale, and tailored for specific capabilities or benchmarks. For instance: - FineWeb is a broad-coverage true pretraining corpus. - FineWeb-Edu is curated for high educational value, ideal for mid-training and great for benchmarks like MMLU. In the context of the math domain: - MegaMath-Web = a true pretraining corpus (about 100 Common Crawl dumps from 2014–2024). - FineMath (3+, 4+) = a mid-training corpus, filtered via edu-style classifiers. So what’s the “educational” version of MegaMath-Web? That would be MegaMath-Web-Pro—we used the same edu classifier as FineMath to extract high-ed value docs, followed by LLM-based refinement for noise reduction. But to be clear: simply filtering MegaMath-Web by math_score isn’t equivalent to using the edu classifier. These are different metrics. It’s important for fair comparisons. Given that EAI-TAXONOMY Math w/ FM and FineMath-3plus are reported in Table 3, I believe MegaMath-Web-Pro also deserves inclusion. (cc @youjiacheng —thanks for the kind mention today and recognition of our work!) Another way to evaluate math corpus quality? Check out our recent work: OctoThinker. We found that mid-training on MegaMath-Web-Pro (and soon, MegaMath-Web-Pro-Max) significantly boosts RL scaling—outperforming FineMath-4+. (See third figure attached.) Blog here: tinyurl.com/OctoThinker The tech report + MegaMath-Web-Pro-Max open release is coming late this week or early next. Still working hard on it—stay tuned! Finally, I want to shout out to the amazing data engineering work from @huggingface (cc @LoubnaBenAllal1). Their contributions—FineWeb, FineWeb-Edu, FineMath, Nanotron—are hugely appreciated. Their curation, technical depth, and open-source spirit inspired much of our own work, including ProX (arxiv.org/abs/2409.17115) and MegaMath. Thank you! If you’re building models and need high-quality corpora—feel free to explore ours: MathPile: huggingface.co/datasets/GAIR/… DCLM-Pro: huggingface.co/datasets/gair-… FineWeb-Pro: huggingface.co/datasets/gair-… MegaMath: huggingface.co/datasets/LLM36… More is coming—let’s keep brainstorming & building. 🚀
Zengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet media
Essential AI@essential_ai

[1/5] 🚀 Meet Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases!

English
2
15
82
10.6K
Andrew Hojel retweetledi
niki parmar
niki parmar@nikiparmar09·
Thrilled to announce our company, essential.ai 🚀 We are in an exciting era of human-computer collaboration evolving the way we will reason with, process and generate information. At Essential AI, we are passionate on advancing capabilities in planning, reasoning, tool use and continual learning that will be critical to bridge the knowledge and skill gap between humans and computers.
GIF
English
36
37
579
162.9K
Andrew Hojel retweetledi
Ashish Vaswani
Ashish Vaswani@ashVaswani·
I'm thrilled to announce our company, @essential_ai . We believe that breakthroughs in AI will unlock the most profound tools for thought, advancing humanity's collective knowledge and capability. essential.ai
GIF
English
104
115
1.7K
411.6K
Andrew Hojel retweetledi
Jerry Liu
Jerry Liu@jerryjliu0·
LLMs can directly extract structured data (esp w/ Function API), but can be slow/expensive. 🤔 Instead: use LLMs to generate code, run code to extract data 💡 We now have code-based extraction in @llama_index - extract df’s from arbitrary text 🧑‍💻 gpt-index.readthedocs.io/en/latest/exam…
Jerry Liu tweet mediaJerry Liu tweet mediaJerry Liu tweet media
English
10
59
291
54.9K
Andrew Hojel retweetledi
Simran Arora
Simran Arora@simran_s_arora·
LMs can be expensive for document processing. E.g., inference over the 55M Wiki pages costs >$100K (>$0.002/1k toks)💰 We propose a strategy that reduces inference cost by 110x and can even improve quality vs. running inference over each doc directly! 💻​ github.com/HazyResearch/e…
English
9
129
754
128K