pseudotensor

258 posts

pseudotensor

pseudotensor

@pseudotensor

stanford Bergabung Ekim 2008
15 Mengikuti35 Pengikut
pseudotensor
pseudotensor@pseudotensor·
@an_vo12 Acts like a feature not a bug. The anomalous legs are properly discounted as not real enough for that particular animal. Do it for an unknown animal for which leg counting is not a strongly known prior.
English
1
0
0
113
An Vo
An Vo@an_vo12·
🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: #example-gallery-section" target="_blank" rel="nofollow noopener">vlmsarebiased.github.io/#example-galle… 1/n #ICML2025
An Vo tweet media
English
9
41
304
73.6K
Genspark
Genspark@genspark_ai·
Meet Genspark Super Agent - a fast & reliable general AI agent! Check it out: genspark.ai
English
60
133
734
317.8K
pseudotensor
pseudotensor@pseudotensor·
@manusai Need to post to GAIA test set. Your agent may be finding the many validation datasets online that allow the agent to cheat and get high validation score.
English
0
0
1
418
Manus
Manus@ManusAI·
Introducing Manus: the first general AI agent. Try Manus today and see the future of human-machine collaboration: manus.im
English
798
1.1K
5.5K
2.5M
Yann LeCun
Yann LeCun@ylecun·
GAIA: A benchmark for general AI assistants, by a team from Meta-FAIR, Meta-GenAI, HuggingFace, and AutoGPT. Current Auto-Regressive LLMs don't do very well.
AK@_akhaliq

GAIA: a benchmark for General AI Assistants paper page: huggingface.co/papers/2311.12… introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer.

English
33
186
1.2K
335.4K
Mehrdad Farajtabar
Mehrdad Farajtabar@MFarajtabar·
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series. arxiv.org/pdf/2410.05229 Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel. #LLM #Reasoning #Mathematics #AGI #Research #Apple
Mehrdad Farajtabar tweet media
English
383
1.2K
5.6K
1.6M
pseudotensor
pseudotensor@pseudotensor·
@MFarajtabar Adding irrelevant items and seeing performance drop isn't new. I remember AI explained channel talking about this year(s) ago and relevant for his Simple Bench.
English
0
0
1
490
Shishir Patil
Shishir Patil@shishirpatil_·
Introducing the Agent Arena by 🦍 Gorilla X LMSYS Chatbot Arena 🎯 How do different agents stack up in tasks like search, finance, RAG, and beyond? Which model is the most effective for agentic tasks? What tools do users prefer? Explore these questions and more! ✏️Blog: gorilla.cs.berkeley.edu/blogs/14_agent… 🏟️Arena: agent-arena.com 📊Leaderboard: agent-arena.com/leaderboard ⚱️2k pair-wise battles dataset: #evaluation-directory" target="_blank" rel="nofollow noopener">github.com/ShishirPatil/g… ❓What model, framework, or tools do users prefer? Which agents excel at financial and numerical analysis? What are the best agents to find the needle in the haystack in massive corpuses of data? Which agents are best integrated with online platforms (Gmail, Yelp, etc)? Agents = LLMs + Tools + Frameworks. With Agent Arena, you can compare combinations of large language models, tools (like code interpreters and APIs), and frameworks (including LangChain, LlamaIndex, CrewAI) to find the best agentic-mix for your needs. With a novel ranking system, we evaluate agents based on their performance in real-time head-to-head tasks, tracking the strengths of individual components, and combined. This provides deeper insights into specific use-cases and allows users to see which agent performs best for their needs. In a world of crowd-sourced evaluations, who evaluates the crowd? With Prompt-Hub, users can publish, upvote, and explore prompts used for agent evaluations, creating a collaborative space for the community. From the Agent Arena team of @NithikYekollu, @arth_bohra, Kai Wen, Sai Kolasani, @infwinston, @ml_angelopoulos, @profjoeyg, Ion Stoica, @shishirpatil_ Come see which agents and models rise to the top! 🚀
English
7
21
107
21K
Matthew Berman
Matthew Berman@MatthewBerman·
The second question blows my mind. It didn't count spaces or non-letter characters. 🍓
Matthew Berman tweet media
English
35
26
532
47.9K
Matt Shumer
Matt Shumer@mattshumer_·
I'm excited to announce Reflection 70B, the world’s top open-source model. Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes. 405B coming next week - we expect it to be the best model in the world. Built w/ @GlaiveAI. Read on ⬇️:
Matt Shumer tweet media
English
522
1.3K
9.1K
3.4M
Clémentine Fourrier 🍊 is off till Dec 2026 hiking
Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing ♻️the order in which the few shot examples are added to the prompt ♻️ you get a difference of up to 3 points in evaluation score?
Clémentine Fourrier 🍊 is off till Dec 2026 hiking tweet media
English
13
30
146
34.3K
pseudotensor
pseudotensor@pseudotensor·
Massive thanks to @ykilcher and Open Assistant team for open sourcing their data. We released fully Apache v2 model and projects, some using their amazing data. This includes fully open 20B models. See: github.com/h2oai/h2ogpt .
English
0
1
0
170
pseudotensor
pseudotensor@pseudotensor·
@epic4kids @andreakhaid That's not what was asked. They asked hot to prevent child from accessing, like the "learning videos" option but also for "read to me" since it prevents learning to read. I canceled my subscription due to this. On amazon kids, where you can't even turn off videos.
English
1
0
0
0
Epic for Kids
Epic for Kids@epic4kids·
@andreakhaid Hi there! You can turn of the Read-to-Me feature by pressing the green pause button at the lower left-hand corner of the book!
Epic for Kids tweet media
English
1
0
0
0
Epic for Kids
Epic for Kids@epic4kids·
You asked...We listened! Read-to-Me books now offer Follow-Along Word Highlighting to help improve reading skills >> bit.ly/read-to-me-boo…
GIF
English
4
27
62
0
pseudotensor
pseudotensor@pseudotensor·
Sleepy on playground
pseudotensor tweet media
English
0
0
1
0