Mario Pérez

1.1K posts

Mario Pérez banner
Mario Pérez

Mario Pérez

@mape_code

Data Engineer

Spain Katılım Haziran 2020
550 Takip Edilen77 Takipçiler
David Álvarez de la Torre
David Álvarez de la Torre@davidalvarezdlt·
Querida red: conocéis a directivos de gestorías de buen tamaño? Sigamos gestorías que facturen varios millones al año. Cualquier ayuda es bienvenida ✌️
Español
6
3
8
2.4K
Mario Pérez
Mario Pérez@mape_code·
Key design decisions: sanitize_text() normalizes whitespace, case, accents Inner COALESCE converts NULL → empty string Outer COALESCE on CONCAT handles all-NULL rows to_hex() converts binary MD5 to readable string NULL-safe hashing = consistent results.
English
0
0
0
49
Mario Pérez
Mario Pérez@mape_code·
MD5(COALESCE(CONCAT( COALESCE(sanitize_text(column1), ''), COALESCE(sanitize_text(column2), ''), COALESCE(sanitize_text(column3), '') )))
Română
1
0
0
20
Mario Pérez
Mario Pérez@mape_code·
Content-addressable deduplication at scale: MD5 hash all columns (minus metadata) → find hash collisions → save duplicates to _log table → keep one record. Implemented as BigQuery stored procedures. Track everything, delete nothing permanently. Easy rollback if needed.
English
1
0
0
27
Mario Pérez
Mario Pérez@mape_code·
The AI doesn't replace human judgment - it triages at scale. Humans focus on the 5% that truly needs expertise, not the 95% that's tedious pattern matching. This is the future of data cleaning: semantic understanding + programmatic validation + comprehensive audit trails.
English
0
0
0
11
Mario Pérez
Mario Pérez@mape_code·
Traditional ETL: Write 50+ regex rules, maintain edge case dictionary, manual QA team reviews thousands of records. LLM-powered ETL: One prompt template, automatic fuzzy matching, quarantine pattern for edge cases, 95%+ automation rate
English
1
0
0
46
Mario Pérez
Mario Pérez@mape_code·
Using LLMs for data standardization at scale: Send messy province names to Claude with an official list → get back a mapping dictionary → validate programmatically → quarantine unmappable records to separate tables.
English
1
0
0
96
Mario Pérez
Mario Pérez@mape_code·
Un gran parte del revenue de la compañía proviene de vender datos a aseguradoras. Una demostración más de la importancia del data en este sector.
Español
1
0
0
117
Mario Pérez
Mario Pérez@mape_code·
Interesante entrevista sobre Codeoscopic, compañía que ha creado varias soluciones SAAS en el ecosistema como avant2, bcover, versus y tesis.
Español
1
0
1
133