5Sas 리트윗함

Excited to announce our first project from the HuggingLegal community: the Gemini-3-Benchmarkathon!
Gemini-3 achieved the top spot on most major benchmarks last week, but how well does it know the law? Unfortunately, most model providers don't evaluate on law-specific benchmarks. So while we have a good idea of how good new models are at coding, we are pretty much in the dark about their lawyering abilities.
This is why we ran a vibe-check on six diverse datasets from Greek bar exams over Indian law questions to Swiss university law exams among others. So what do the vibes say?
AA-Omniscience: 6/10 – high competence but unreliable
LegalBench: 9.5/10 – almost perfect answers
GreekBarBench: 9/10 – impressive long-context reasoning without hallucinations
IndianLawQA: 8.5/10 – The task feels precise, clean, and insightful, revealing how reliably Gemini-3 handles high-precision statutory queries (including new BNSS 2023 codes) while most models typically hallucinate in this domain.
WilfulMisconduct: 8/10 – strong logic, missed binding precedent
LEXam: 7/10 – competent but overly confident
Gemini-3 is extremely strong at many legal tasks and often performs at or above the level of very good human lawyers. However, it still makes serious mistakes: it often answers confidently even when it does not know the right legal rule or fact, instead of saying “I don’t know.” Because of this overconfidence and some failures on complex reasoning and precedents, it cannot safely replace human lawyers and still needs expert oversight.
This is a great community effort, thanks for the collaboration Robert Scholz, @5vsas, Ernest Beta, @odychlapanis, @adhipba, Matteo Bürgler, Sophie Franco, Chu Fei Luo, @samdahan06!
Find the link to the article below.

English
