
As LLMs continue to improve, their ability to reason causally, especially across languages, is still inconsistent. @WeloData evaluated 20+ models using complex prompts in 6 languages & uncovered key gaps in performance.
Check out the multilingual framework + full results here ↓
English






















