pseudotensor

258 posts

pseudotensor

@pseudotensor

stanford Bergabung Ekim 2008

15 Mengikuti35 Pengikut

pseudotensor@pseudotensor·26 Ağu

@an_vo12 Acts like a feature not a bug. The anomalous legs are properly discounted as not real enough for that particular animal. Do it for an unknown animal for which leg counting is not a strongly known prior.

English

113

An Vo@an_vo12·11 Tem

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: #example-gallery-section" target="_blank" rel="nofollow noopener">vlmsarebiased.github.io/#example-galle… 1/n #ICML2025

English

304

73.6K

pseudotensor@pseudotensor·4 Nis

For GAIA, the dataset genspark.ai used is called a "validation" data set that is leaked all over internet. They should submit to the official test set. They also excluded results from Trase and H2O.ai referring to a 4 month old result as "previous SOTA": huggingface.co/spaces/gaia-be… and h2o.ai/blog/2025/h2o-…

English

288

Genspark@genspark_ai·2 Nis

Meet Genspark Super Agent - a fast & reliable general AI agent! Check it out: genspark.ai

English

133

734

317.8K

pseudotensor@pseudotensor·6 Mar

@manusai Need to post to GAIA test set. Your agent may be finding the many validation datasets online that allow the agent to cheat and get high validation score.

English

418

Manus@ManusAI·5 Mar

Introducing Manus: the first general AI agent. Try Manus today and see the future of human-machine collaboration: manus.im

English

798

1.1K

5.5K

2.5M

pseudotensor@pseudotensor·1 Oca

@ylecun 1 year later, LLMs get 65% on test set vs. degree-holding humans at 92%. Not bad year I think. h2o.ai/blog/2024/h2o-…

English

Yann LeCun@ylecun·23 Kas

GAIA: A benchmark for general AI assistants, by a team from Meta-FAIR, Meta-GenAI, HuggingFace, and AutoGPT. Current Auto-Regressive LLMs don't do very well.

AK@_akhaliq

GAIA: a benchmark for General AI Assistants paper page: huggingface.co/papers/2311.12… introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer.

English

186

1.2K

335.4K

pseudotensor@pseudotensor·12 Eki

@MFarajtabar E.g. a paper you didn't cite uses gsm hard and shows major drop in performance: arxiv.org/abs/2406.07394

English

645

Mehrdad Farajtabar@MFarajtabar·10 Eki

1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series. arxiv.org/pdf/2410.05229 Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel. #LLM #Reasoning #Mathematics #AGI #Research #Apple

English

383

1.2K

5.6K

1.6M

pseudotensor@pseudotensor·12 Eki

@MFarajtabar Similar results for gsm8k hard huggingface.co/datasets/reaso…, which changes all the numbers and if LLMs were just applying some basic math it shouldn't matter much, but it matters alot.

English

679

pseudotensor@pseudotensor·12 Eki

@MFarajtabar Adding irrelevant items and seeing performance drop isn't new. I remember AI explained channel talking about this year(s) ago and relevant for his Simple Bench.

English

490

pseudotensor@pseudotensor·4 Eki

@shishirpatil_ Why is LMSYS in the name and blog? Seems unrelated to them: lmsys.org

English

140

Shishir Patil@shishirpatil_·3 Eki

Introducing the Agent Arena by 🦍 Gorilla X LMSYS Chatbot Arena 🎯 How do different agents stack up in tasks like search, finance, RAG, and beyond? Which model is the most effective for agentic tasks? What tools do users prefer? Explore these questions and more! ✏️Blog: gorilla.cs.berkeley.edu/blogs/14_agent… 🏟️Arena: agent-arena.com 📊Leaderboard: agent-arena.com/leaderboard ⚱️2k pair-wise battles dataset: #evaluation-directory" target="_blank" rel="nofollow noopener">github.com/ShishirPatil/g… ❓What model, framework, or tools do users prefer? Which agents excel at financial and numerical analysis? What are the best agents to find the needle in the haystack in massive corpuses of data? Which agents are best integrated with online platforms (Gmail, Yelp, etc)? Agents = LLMs + Tools + Frameworks. With Agent Arena, you can compare combinations of large language models, tools (like code interpreters and APIs), and frameworks (including LangChain, LlamaIndex, CrewAI) to find the best agentic-mix for your needs. With a novel ranking system, we evaluate agents based on their performance in real-time head-to-head tasks, tracking the strengths of individual components, and combined. This provides deeper insights into specific use-cases and allows users to see which agent performs best for their needs. In a world of crowd-sourced evaluations, who evaluates the crowd? With Prompt-Hub, users can publish, upvote, and explore prompts used for agent evaluations, creating a collaborative space for the community. From the Agent Arena team of @NithikYekollu, @arth_bohra, Kai Wen, Sai Kolasani, @infwinston, @ml_angelopoulos, @profjoeyg, Ion Stoica, @shishirpatil_ Come see which agents and models rise to the top! 🚀

English

107

21K

pseudotensor@pseudotensor·21 Eyl

🚨 BREAKING: Open-Strawberry aims to recreate OpenAI's o1 as open-source! 🍓 🔓 Democratizing AI 🚀 Accelerating innovation 🌐 Community-driven development Join the revolution: github.com/pseudotensor/o… RT to support open AI! 🔄 #OpenSourceAI #AIRevolution

English

122

pseudotensor@pseudotensor·13 Eyl

@MatthewBerman A simple coding agent can do these kinds of things:

English

169

Matthew Berman@MatthewBerman·13 Eyl

The second question blows my mind. It didn't count spaces or non-letter characters. 🍓

English

532

47.9K

pseudotensor@pseudotensor·5 Eyl

@mattshumer_ @GlaiveAI Very first try failed. It's thinking is poor.

English

Matt Shumer@mattshumer_·5 Eyl

I'm excited to announce Reflection 70B, the world’s top open-source model. Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes. 405B coming next week - we expect it to be the best model in the world. Built w/ @GlaiveAI. Read on ⬇️:

English

522

1.3K

9.1K

3.4M

pseudotensor@pseudotensor·30 Mar

@clefourrier This was done here among other papers: arxiv.org/abs/2402.01781…

English

Clémentine Fourrier 🍊 is off till Dec 2026 hiking@clefourrier·29 Mar

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing ♻️the order in which the few shot examples are added to the prompt ♻️ you get a difference of up to 3 points in evaluation score?

Clémentine Fourrier 🍊 is off till Dec 2026 hiking tweet media

English

146

34.3K

pseudotensor@pseudotensor·11 Ara

Try mistral.ai mixtral 8*7B at gpt.h2o.ai

English

151

pseudotensor@pseudotensor·22 Nis

Massive thanks to @ykilcher and Open Assistant team for open sourcing their data. We released fully Apache v2 model and projects, some using their amazing data. This includes fully open 20B models. See: github.com/h2oai/h2ogpt .

English

170

pseudotensor@pseudotensor·22 Nis

Largest fully apache2 20B parameter LLM with free chatbot: gpt.h2o.ai Checkout the project at: github.com/h2oai/h2ogpt

English

181

pseudotensor@pseudotensor·19 Nis

20B parameter chatbot speaks on the coming of AGI

Arno Candel@ArnoCandel

huggingface.co/h2oai/h2ogpt-o… #h2oGPT #GPT #OSS #OpenSource ^ 20B parameters Apache 2.0!

English

pseudotensor me-retweet

Arno Candel@ArnoCandel·19 Nis

huggingface.co/h2oai/h2ogpt-o… #h2oGPT #GPT #OSS #OpenSource ^ 20B parameters Apache 2.0!

English

2.3K

pseudotensor@pseudotensor·13 Nis

@epic4kids @andreakhaid That's not what was asked. They asked hot to prevent child from accessing, like the "learning videos" option but also for "read to me" since it prevents learning to read. I canceled my subscription due to this. On amazon kids, where you can't even turn off videos.

English