ellamind

47 posts

ellamind banner
ellamind

ellamind

@ellamindAI

Building elluminate. AI Evaluations, simplified. Also: Data Sovereignty, Privacy, Performance & some research. Come & work with us.

Bremen, Germany Katılım Ocak 2024
12 Takip Edilen167 Takipçiler
Sabitlenmiş Tweet
ellamind
ellamind@ellamindAI·
AI evaluations are broken. Generic benchmarks tell you nothing. Manual QA doesn't scale. And existing tools are either too academic or simplify in the wrong places. That's why we built elluminate - evals that actually work for real product teams.
English
1
5
11
4.1K
ellamind retweetledi
Max Idahl
Max Idahl@maxidahl·
@fujikanaeda In case you are interested in speedrunning a German version in collab with @ellamindAI , hit me up. We can take care of the locale work and also got some B200 compute to spare.
English
0
1
1
32
ellamind retweetledi
OpenEuroLLM
OpenEuroLLM@OpenEuroLLM·
Experimenting with model-based annotation for better data selection? A candidate to consider is propella-1, a multi-property annotator partially funded by #OpenEuroLLM which is fully open-source. 🔓Code, annotations and paper available! arxiv.org/pdf/2602.12414
ellamind@ellamindAI

We released propella-1, a small model for advanced pre-training data annotation 🙃. Work led by @maxidahl within the @OpenEuroLLM project. Link to model + annotations for important pre-training datasets below 👇

English
0
1
4
160
ellamind retweetledi
Jan P. Harries
Jan P. Harries@jphme·
M 2.5 by @MiniMaxAI_ is currently the most popular open weights model on @OpenRouter, but is also heavily censored. Inspecting the CoT`s reveals deliberate lying, which can also be problematic in other areas as @AnthropicAI`s research has shown. Some examples attached 👇
Jan P. Harries tweet mediaJan P. Harries tweet mediaJan P. Harries tweet media
English
2
1
1
106
ellamind
ellamind@ellamindAI·
Our @TheBitFlipper built an in-house benchmark for coding agents, based on real PRs from our codebase. As expected from our vibes (and other benchmarks), Opus takes the crown 🥇 - GPT-5.2 results still outstanding though 👀
Damian Barabonkov@iamdamianb

Public benchmarks are easy to game. I built swellubench to validate real features and bug fixes from a production platform at @ellamindAI. It evaluates models on private, real-world coding tasks to measure true performance and cut through benchmark maxing noise. Methodology in 🧵

English
1
0
1
85
ellamind
ellamind@ellamindAI·
Machine translated data beats native language data? 🤔 As part of @OpenEuroLLM, we produced >5 trillion tokens of multilingual pretrain data for low-resource languages with >3M tps on LEONARDO (CINECA). Findings presented at @BSC_CNS. led by @maxidahl, release coming soon 🙂.
ellamind tweet mediaellamind tweet mediaellamind tweet media
Barcelona, Spain 🇪🇸 English
0
2
4
206
Wolfram Ravenwolf
Wolfram Ravenwolf@WolframRvnwlf·
The wolf steps off the boat after six intense months at @ellamindAI, where we took our eval platform elluminate from pilot to GA. Thanks to the entire team for the great teamwork and community. I'm leaving the company, but we're parting on the best of terms and will stay in touch. What's next for me? Stay tuned! 🐺✨
English
2
0
10
552
ellamind retweetledi
Jan P. Harries
Jan P. Harries@jphme·
This is just a small vibecheck (more currently not possible due to rate limits) - but in the German Geo eval I built on stage yesterday evening, @Alibaba_Qwen 3-Max doesn't look competitive with other top models and also falls far behind e.g. R1 or GLM 4.5. 😕 @ellamindAI
Jan P. Harries tweet media
English
1
2
6
1.9K
ellamind
ellamind@ellamindAI·
Building AI products? You need real evaluations. Let's talk. elluminate.de
English
1
1
4
136
ellamind
ellamind@ellamindAI·
The result? Teams ship faster with confidence. Product managers can actually trust their metrics. And developers spend time building, not firefighting. Whether you're a developer tired of vibe-checking, a PM who needs reliable metrics, or a domain expert who knows what "good" looks like, elluminate speaks your language.
English
1
1
2
149
ellamind
ellamind@ellamindAI·
AI evaluations are broken. Generic benchmarks tell you nothing. Manual QA doesn't scale. And existing tools are either too academic or simplify in the wrong places. That's why we built elluminate - evals that actually work for real product teams.
English
1
5
11
4.1K
ellamind
ellamind@ellamindAI·
Our co-founders project #LeoLM highlighted by @bmftr_bund. Today, we´re continuing what started as a student`s side-project with @OpenEuroLLM (and more to come). If you want to work on Open Source AI, multilingual applications and AI evaluations as well - we´re hiring! 🙂
Björn Plüster@bjoern_pl

Nearly two years after release my project LeoLM is being used as a strong justification for the expansion of federal compute funding in Germany. Goes to show how much impact open-source projects can have. Hell yeah @bmftr_bund - thanks for making projects like this possible! 🚀

English
1
2
2
536
ellamind retweetledi
Jan P. Harries
Jan P. Harries@jphme·
GPT-5 is worse than GPT-4o 😳 ... ...at least for some writing tasks in German (and probably also other languages...) 👇
Jan P. Harries tweet media
English
5
10
48
6.9K