Humyn Labs (@humynlabs) - Twitter پروفائل

پن کیا گیا ٹویٹ

Hey humyns! I am @MaiDiab_ , Head of Data at @HumynLabs, and I'm taking over the account today. A bit about me: > PhD in Computer Vision > Originally from Egypt, I currently live in the UK > My obsession with data started when I was building a 40k image dataset from scratch, annotating it rule by rule, becoming my own QC > I bring that same obsession to our work here at Humyn Labs Currently, I am working on building multimodal data across audio, video and language spanning 20+ countries for use cases ranging from voice LLM, Physical AI to Simulation Data. Through the day I'll be sharing what we're building here at Humyn Labs, breaking down some research and answering your questions. Ask me anything over the next few hours and I'll answer them for you! 👇 Super excited. Let's do this.

English

27

9

35

970

Humyn Labs@humynlabs·1d

We're building in 14 countries. Here's why the geography matters. India, Brazil, Nigeria, Indonesia, Vietnam, Philippines, Egypt, Saudi Arabia, UAE, Peru, Argentina, South Korea, Nepal, Sri Lanka. For physical AI and audio models, we know the hardest problem is data. Specifically: legally usable, linguistically diverse, environmentally representative data at scale. You can't scrape it. You can't crowdsource casually. Consent frameworks differ by jurisdiction: > EU AI Act: audio and biometric data, high-risk category. > Brazil LGPD: rigorous by global standards. > India DPDP: evolving fast, diversity unmatched. > Gulf and SEA: lighter but moving quickly. Sub-Saharan Africa and LATAM: most underrepresented data on earth. Every geography has a different answer to: can we use this for training, here, and defend it? We're building a data infrastructure that answers yes, jurisdiction by jurisdiction. Take consent for example. We record it by voice, in the person's own language, before any audio is collected. Because a voice consent is harder to tamper with than any paper trail. This is just one instance. Every market has its own version of this. Models trained on data collected this way don't just pass a legal audit but they actually perform in the geography they're meant to serve. Want to talk to us? DMs open.

English

1

3

9

150

Humyn Labs@humynlabs·1d

@Track_kvng @MakamSreenivas @itishadubey15 we are glad!

English

0

1

9

Tiyeye〽️@Track_kvng·1d

@humynlabs @MakamSreenivas @itishadubey15 I came to X to pick few things, but I've been carried away by Humyn labs again. I think this is a step forward btw, let's see how better it gets from here on.

English

1

0

1

6

Humyn Labs@humynlabs·2d

Just finished a call with my wonderful team! We're finalising our annotation validation tool going into production, and we just hit something interesting. At Humyn Labs we build training data across international languages from Egyptian Arabic to Southeast Asian dialects. And today we're jumping on a call to solve something that just surfaced while finalising our annotation validation tool for production. We're building spell-checks for 20+ languages. Sounds simple? It's not. For example, Standard libraries like SymSpell and Hunspell completely fall apart on native scripts. They tokenise Indic words by syllable and affix rules instead of treating them as whole words. So a word like किताबब gets split into किताब + ब and the spell-check corrects the fragment, not the actual mistake. Our solution: a custom dictionary validator with fuzzy matching built on Go's rune data structure (handles multi-byte characters correctly) tree-loaded into memory at template initialisation so it's not recalculating on every annotation. Scalable, accurate, no external model dependency. This is what building for real linguistic diversity actually looks like. Anyway, getting back to answering all your questions now👩‍💻

Humyn Labs@humynlabs

Hey humyns! I am @MaiDiab_ , Head of Data at @HumynLabs, and I'm taking over the account today. A bit about me: > PhD in Computer Vision > Originally from Egypt, I currently live in the UK > My obsession with data started when I was building a 40k image dataset from scratch, annotating it rule by rule, becoming my own QC > I bring that same obsession to our work here at Humyn Labs Currently, I am working on building multimodal data across audio, video and language spanning 20+ countries for use cases ranging from voice LLM, Physical AI to Simulation Data. Through the day I'll be sharing what we're building here at Humyn Labs, breaking down some research and answering your questions. Ask me anything over the next few hours and I'll answer them for you! 👇 Super excited. Let's do this.

English

1

2

10

168

Humyn Labs@humynlabs·1d

That's a wrap on today's takeover, everyone! Thanks for tuning in and sending some really fun and interesting questions. I had a great time answering them. It's always lovely to chat with folks that understand my work and are curious about it. Before I go… a quick update on something groundbreaking my team and I are working on. We ran the first benchmark evaluating commercial ASR APIs on naturalistic, dual-speaker conversations. Fourteen models. Seven languages. All real conversations. The results? The “supposed leader” in Indic ASR didn't come out on top. We’ll be dropping the benchmark very soon. Watch this space!

English

3

2

18

181

Humyn Labs@humynlabs·1d

Interesting question! Think of it like this: the base model is a jazz musician who's heard every genre ever recorded. It improvises, takes risks, sometimes plays something weird, sometimes plays something brilliant. RLHF is a focus group telling that musician which solos the audience clapped for. Problem is, audiences clap for what’s familiar. So the musician slowly stops improvising and starts playing the safe licks that always land. Still technically good. But the magic's gone. The reward model can't tell the difference between creative and wrong, so it kills both. The fix isn't less training. It's better feedback. If your annotators can't reward a good risk, the model learns to never take one. Hope this helps!

English

0

13

Moklasur Rahman@iammoklasur·2d

@humynlabs @MaiDiab_ Why does a model sometimes feel smarter and more creative during early-stage training but become lobotomized and boring once you apply safety tuning and RLHF?

English

2

0

1

82

Humyn Labs@humynlabs·2d

Hey humyns! I am @MaiDiab_ , Head of Data at @HumynLabs, and I'm taking over the account today. A bit about me: > PhD in Computer Vision > Originally from Egypt, I currently live in the UK > My obsession with data started when I was building a 40k image dataset from scratch, annotating it rule by rule, becoming my own QC > I bring that same obsession to our work here at Humyn Labs Currently, I am working on building multimodal data across audio, video and language spanning 20+ countries for use cases ranging from voice LLM, Physical AI to Simulation Data. Through the day I'll be sharing what we're building here at Humyn Labs, breaking down some research and answering your questions. Ask me anything over the next few hours and I'll answer them for you! 👇 Super excited. Let's do this.

English

27

9

35

970

Humyn Labs@humynlabs·1d

Interesting question! Think of it like this: the base model is a jazz musician who's heard every genre ever recorded. It improvises, takes risks, sometimes plays something weird, sometimes plays something brilliant. RLHF is a focus group telling that musician which solos the audience clapped for. Problem is, audiences clap for what’s familiar. So the musician slowly stops improvising and starts playing the safe licks that always land. Still technically good. But the magic's gone. The reward model can't tell the difference between creative and wrong, so it kills both. The fix isn't less training. It's better feedback. If your annotators can't reward a good risk, the model learns to never take one. Hope this helps!

English

0

1

33

Humyn Labs@humynlabs·1d

Interesting. The thing is, nobody has it is because intent doesn't live in one modality. In voice it's the pause before an escalation. In egocentric video, it's the hesitation before a grasp. In text, it's the edit someone deletes before hitting send. You need the layers together, audio, vision, context, timing, to reconstruct what someone was trying to do vs what they did. The gap between those two is intent. Most datasets only capture the second one. :)

English

0

30

Serpa GG@gg_serpa·2d

@humynlabs @MaiDiab_ @humynlabs @MaiDiab_ If I had to pick the holy grail of datasets, it would be a high-fidelity, anonymized map of human "In-the-Moment" Intent.

English

1

0

1

61

Humyn Labs@humynlabs·1d

Uff! @Mysterio077 This question is what I can write pages to answer it but I will try not to. We don't measure "good data" with one number. Let’s take one modality, audio, and take you through the pipeline. Every audio file goes through a queue of validators before a human ever touches it. We've built so many at this point that the team started naming them after the engineers who wrote them. Yogesh's validator checks duplicate, audio integrity where speaker1 audio and speaker 2 audio gives combined audio, WAV headers, spectral ceilings, and encoding artifacts. Another one runs NFKC-normalized WER against a reference transcription. Another validates user metadata. Another validate the transcription schema. We literally had a meeting about one of the validators one hour back. After that, every utterance gets human annotation across speaker diarization, transcription, and segment boundaries. Then a separate QC reviewer either accepts or rejects with a documented reason. percentage of the accepted one will go to super QC. No shortcut, no auto-approve. The metric I actually care about most? Finding out where the model breaks. > Proper noun error rate in isolation. > Accuracy drop on overlapping speech. > Code-switch precision per language pair. Overall WER hides all of that. What’s more interesting are the in-house tools we’ve built. We built our own annotation and QC system. Orchestrator layer handles task lifecycle and lease-based assignment, tool layer handles the actual annotation UX with transliteration, spell-check and a diarization magnet for segment snapping. Custom, not off-the-shelf. And we continue to experiment and build more to make our processes state-of-the-art. Sorry for the long-ish response but I hope this answers your question! Same as your

Sanchit@Mysterio077

@humynlabs @MaiDiab_ how do you measure ‘good data’ internally any specific metrics or frameworks? and What tools does your team use for large scale data annotation and QA

English

0

2

91

Humyn Labs@humynlabs·2d

Honest answer from someone who benchmarks ASR models all day: we’re not in a Transformer prison, we’re in a Transformer comfort zone. Mamba is the one I’d watch. In speech, SSM-based decoders are starting to match Transformer decoders on WER in some settings, while scaling linearly with sequence length. Work like Samba-ASR suggests this direction is promising. But the real move is hybrids. Jamba keeps the in-context learning that pure SSMs still struggle with, while gaining efficiency on long sequences. The architecture that wins will be the one that handles real-world audio: overlap, code-switching, noise, multilingual drift. That’s still unsolved regardless of what’s under the hood.

English

0

31

Ferdows Ayon@FerdowsAyon·2d

@humynlabs @MaiDiab_ Are we all just trapped in a Transformer Prison, and what is the one non-attention-based architecture (like SSMs or Mamba) that actually has a shot at dethroning it?

English

1

0

1

59

Humyn Labs@humynlabs·2d

Love this question @DevBafna08 Currently, long tail environments and low resource languages are still massively underrepresented. Edge geographies, non-western infrastructure, informal economies, mixed modality interactions like voice + gesture in noisy settings. Most datasets collapse this variability into “noise” instead of treating it as signal. Diversity is NOT a checkbox. It is a sampling strategy. You need to design for it at collection time. Stratified sampling across geography, socio-economic context, device conditions, and behavioral patterns and then continuously audit distribution drift. Also important, annotator diversity. If your labeling layer is homogeneous, your dataset will be too, regardless of how diverse the raw data looks.

GIF

Dev Bafna@DevBafna08

@humynlabs @MaiDiab_ What do you think is underrepresented when it comes to data? How can we ensure diversity, you think?

English

1

0

5

122

Humyn Labs@humynlabs·2d

@itsvknox @MaiDiab_ this! x.com/humynlabs/stat…

Humyn Labs@humynlabs

This is such an interesting question! Thanks for asking. I think, that most people think that multimodal = better by default and that it makes the model smarter. In reality, current multimodal systems are not optimized for true cross-modal reasoning but for performance aggregation. As a result, they often exhibit modality dominance, where one modality (typically vision or text) overrides the others, especially under conflicting signals. Empirical work such as 'The Curse of Multi-Modalities and Eyes Wide Shut?' shows that weak alignment between modalities, combined with imperfect unimodal representations (e.g., CLIP-based vision), leads to systematic hallucinations and miscalibrated outputs. Multimodality, therefore, does not guarantee better reasoning. Without explicit objectives for cross-modal consistency, alignment, and conflict resolution, it often increases the space of plausible but incorrect predictions making models more confidently wrong rather than more accurate. So next time, someone tells you, multimodal data is >>> Send them this tweet.

English

0

45

vknox | KGeN@itsvknox·2d

@humynlabs @MaiDiab_ Hello @MaiDiab_ what's a common industry belief about multimodal AI that you think is completely wrong?

English

1

0

1

59

Humyn Labs@humynlabs·2d

depends on what you mean by "synthetic." If a model keeps training on its own outputs with nothing grounding it, it collapses. tails of the distribution go first, then the modes start entangling. Shumailov et al. showed this pretty cleanly in Nature last year. that's the incestuous loop people worry about, and it's real. but synthetic data with a verifier behind it is a different thing. then it is code that compiles, math that checks and physics sims with ground truth. that loop has an anchor, and it scales fine. it's where a lot of the recent reasoning gains are actually coming from. tbh the 2028 number gets misread all the time. it's about cheap public text running out, not data overall. hope this answers your question!

English

0

3

Peter Parker@petarpar·2d

@humynlabs @MaiDiab_ If we actually run out of high-quality human-generated text by 2028, is synthetic data a legitimate bridge or just an incestuous loop that will eventually lead to model collapse?

English

1

0

1

13

Humyn Labs@humynlabs·2d

when your boss uses X instead of slack :P @manishdiesel jokes apart, to answer your question with some context: EgoScale shows scaling egocentric data improves dexterity, but mostly within-distribution. datasets like Ego4D still skew toward structured daily-life settings. and we project rich human signals (hand pose, gaze, dense narration) into action-centric formats like RLDS losing the very information needed to generalize.

Manish | HumynLabs; KGeN@manishdiesel

@humynlabs @MaiDiab_ What is the core problem you see in robotic training using egocentric RDLS ?

English

1

9

191

Humyn Labs@humynlabs·2d

@ABHISHEK5654519 It's very hard to annoy me but one thing that does is when I see some people treat data as an afterthought. I might be biased bc it's my work. But it is really important when it comes to model output. Model architecture gets 90% of the attention, data gets 10% and then everyone's shocked when the model fails in the real world. It is literally, what you sow, so you shall reap when it comes to model training.

Abhishek@ABHISHEK5654519

@humynlabs @MaiDiab_ what’s your biggest pet peeve when it comes to multimodal AI

English

0

1

25

Humyn Labs@humynlabs·2d

Good Q! Reputation that lives in a platform’s private database has no value outside that platform. That is why traditional systems persist, annotator lock in is the moat. Humyn Labs' POE is computed from real QC signals such as gold accuracy, consensus agreement, guideline adherence, and rework rate across profile, skills, domain expertise, and reliability. It is not based on self reports. On-chain systems enable portable reputation that the annotator owns, tamper evident provenance linking every sample to the annotator, QC path, and timestamp, and third party verifiability without requiring platform access. The key property is that proof is written at the time of data capture, not reconstructed later. A paper trail assembled months afterward does not hold up under regulatory audit. The EU AI Act has been binding since August 2, 2025, with penalties of up to €15M or 3 percent of global turnover for firms that cannot prove data source. We have built this because we know that “trust us” is no longer sufficient.

Oleran@Oleran_

@humynlabs @MaiDiab_ Hi, @MaiDiab_! Humyn uses blockchain for annotator reputation (Proof of Expert). Why that approach, and how does it actually solve the traceability problem that traditional platforms just ignore?

English

0

7

109

Humyn Labs@humynlabs·2d

@paul_mr63072 @MaiDiab_ so cute! i would tell that I feed ai with really good food so that it grows up smart. except the food is audio recordings, videos, and text in lots of different languages and my job is making sure the food isn't rotten before it goes in. xD

GIF

English

0

1

43

Mr Paul | KGeN@paul_mr63072·2d

@humynlabs @MaiDiab_ if you had to explain what you do to a 10 year old, what would you say?

English

1

0

1

54

Humyn Labs@humynlabs·2d

early in my career, i trained a pedestrian detection model on a public dataset and achieved >90% mAP. On deployment in North Africa, the performance dropped by over 40%. Since the dataset was heavily biased toward western environments, it lacked key elements like mixed traffic (scooters, rickshaws), local demographics, and real-world variability. the model had learned a narrow data distribution and failed to generalize. we had to rebuild the dataset from scratch. the real cost was lost time. lesson learnt! now I always ensure that we validate against a small, representative sample of the target distribution early to catch failures like these.

English

0

34

diysayr@diysayr·2d

@humynlabs @MaiDiab_ what’s the most “expensive” mistake you’ve made while training an AI model and what did you learn?

English

1

0

1

85

Humyn Labs@humynlabs·2d

This is such an interesting question! Thanks for asking. I think, that most people think that multimodal = better by default and that it makes the model smarter. In reality, current multimodal systems are not optimized for true cross-modal reasoning but for performance aggregation. As a result, they often exhibit modality dominance, where one modality (typically vision or text) overrides the others, especially under conflicting signals. Empirical work such as 'The Curse of Multi-Modalities and Eyes Wide Shut?' shows that weak alignment between modalities, combined with imperfect unimodal representations (e.g., CLIP-based vision), leads to systematic hallucinations and miscalibrated outputs. Multimodality, therefore, does not guarantee better reasoning. Without explicit objectives for cross-modal consistency, alignment, and conflict resolution, it often increases the space of plausible but incorrect predictions making models more confidently wrong rather than more accurate. So next time, someone tells you, multimodal data is >>> Send them this tweet.

MancosNFT@MancosNft2

@humynlabs @MaiDiab_ What's a common industry belief about multimodal AI that you think is completely wrong?

English

0

5

147

Humyn Labs@humynlabs·2d

Hello! Thanks for this question. I personally love studying cognitive science and linguistics! Tbh my PhD started with a simple question: how does the brain track a person in a crowded scene? when someone disappears behind a pillar, we don’t lose them. we maintain a mental representation and re-identify them when they reappear. Tracking is predictive, not reactive. Work by Renée Baillargeon shows even infants understand this. So a lot of it is built-in priors, not just learned patterns. That completely changed how I think about ML systems. Linguistics adds another layer. Real speech is messy, full of pauses, corrections and code-switching. Both made me realize the same thing: most gaps in ML systems are data problems.

sam@v1dyam4xx3r

@humynlabs @MaiDiab_ what's your non-technical book/hobby that actually makes you a better data scientist?

English

0

5

162

Humyn Labs@humynlabs·2d

I spent the last two weeks reading the most recent work on Arabic speech technology. What I found surprised me enough to write it up. 400 million Arabic speakers. 30+ dialects. Most of them underrepresented in every major dataset. The gap isn't a model problem. It's a data problem. And it's bigger than most people think. Full breakdown below. Would love to hear from anyone building in the space. Open for questions!

Humyn Labs@humynlabs

x.com/i/article/2044…

English

0

1

9

124

Humyn Labs

دریافت کریں