Javier Marin

189 posts

Javier Marin banner
Javier Marin

Javier Marin

@jamarinval

I solve the deployment problem: getting AI from prototype → production with systematic failure measurement + compliance. The last 5% that determines success.

Madrid Bergabung Ağustos 2010
546 Mengikuti405 Pengikut
Javier Marin
Javier Marin@jamarinval·
It’s clear that current autoregressive models do not equal human intelligence. Everybody knows that. But I do not understand why so much people acts like “well, AI are only good pattern matching systems”. We should remember that our universe started with hydrogen + helium and end up writing poetry.
English
0
0
3
51
Javier Marin
Javier Marin@jamarinval·
Just benchmarked Claude, ChatGPT, Gemini & Grok against each other. Here's what I learned that public leaderboards don't show. The irony of AI benchmarking is this: benchmarks have driven massive progress in the field, but they're almost useless for choosing the right model for your business. We ran the a propietary framework across 4 leading providers, measuring: ✓ Consistency (behavioral reliability across multiple trials) ✓ Performance (quality on real-world business tasks) ✓ Coordination effects (multi-step workflow performance) The results? There's no winner. Or rather, there are four different winners: 🔵 Claude excels at consistency (0.94 score) — if you need auditability, compliance, and predictable behavior, Claude is your model. Regulatory bodies will love it. Cost premium is worth it for decision-critical workflows. 🟡 Grok maximizes performance (88/100 score) — pure output quality. Creative problem-solving, complex analysis, "give me your best answer" tasks. Trades consistency for ceiling height. 🟢 Gemini balances both — neither specialized nor weak. Great if you have diverse workloads and want to minimize switching risk. 🔴 ChatGPT holds the middle — reliable across domains, broad ecosystem, trusted integrations. Your vendor choice matters way less than testing on your actual data. We found that cost-quality tradeoffs, edge case handling, and degradation under load are completely specific to each use case. If you're deployed to production with AI: spend 10-20 hours self-benchmarking. It's ~0.1% of annual AI infrastructure cost and one of the highest-ROI investments you can make. The benchmark that matters most? The one you run yourself. 👉 For more details about the pilot test you can DM me.
Javier Marin tweet media
English
0
0
0
103
Javier Marin
Javier Marin@jamarinval·
🤖 When you build autonomous AI systems, you expect them to understand urgency the way we do. A doctor knows when a treatment window has closed. A trader knows when a market opportunity has passed. A plant manager knows when preventive maintenance is no longer an option. I wondered: ⏰ Can LLMs develop the same sense of "it's too late" without being explicitly programmed for it? 🔗 Which temporal relations are learnable through next-token prediction on natural language, and which require explicit architectural support? I tested this with production-ready models (small and mid-size) across scenarios where timing matters—emergency response, medical treatment, financial decisions.  Two basic conclusions: 🔵 Model accuracy is very prompt-dependent. Different phrasing, same meaning, accuract drops dramatically. These models are very "brittle" 🔵 Performance depends primarily on training data composition 🔵 When finetuning with Lora on simple deadline detection it demonstrates you can retrain pattern matching on a new distribution (but doesn't demonstrate "the skill is learnable.") This isn't about model size or benchmark scores. A 3.8B model matched some 7B models. Others failed completely. Standard evaluations won't catch this. 📫 If you want more information about this experimental work you can DM me.
Javier Marin tweet media
English
0
0
0
54
Javier Marin
Javier Marin@jamarinval·
You are totally right @zcbenz @exhaze . I arrived to the same conclusion: σ is entropy production rate and ΔI is information processing capacity. Whenever any system — whether a brain, a computer, or even chemical reactions — processes information, it must dissipate energy as waste heat. The more information you process, the more energy you must waste. Since attention mechanisms process information (deciding what to focus on), they’re subject to this energy tax. This creates universal pressure for efficient architectures — whether you’re evolution designing a brain, chemistry organizing reactions, or gradient
Javier Marin tweet media
English
0
0
0
29
Cheng
Cheng@zcbenz·
With this experiment, I wonder whether attention was invented from top-down with intuition, or it is actually inevitable when you want to design a language model with parallelization.
English
1
0
0
336
Cheng
Cheng@zcbenz·
Let's design a minimal language model from scratch with modern features. The first things I want are: * steps in the sequence affect each other * ability to train a sequence in fixed number of ops The first design below has an obvious problem: the sequence length is fixed.
Cheng tweet media
English
1
0
0
521
Javier Marin
Javier Marin@jamarinval·
@wesroth I think intelligence is deliberate creation that open new possibilities with every move. So entropy increases. Order can't be imposed in the universe (unless we ignore the second law).
English
0
0
0
4
Wes Roth
Wes Roth@WesRoth·
Mo says entropy pushes the universe toward chaos, and intelligence pushes back by making order. Intelligence is deliberate creation, placing every “color” exactly where it belongs. The smartest systems impose order with the least energy and the least waste. Efficiency and minimal waste are the hallmarks of real intelligence.
English
14
18
92
7.1K
Javier Marin
Javier Marin@jamarinval·
@omooretweets Not sue @omooretweets if the reason is this: a more pragmatic look could show that they are by-passing their current clients (startups buying a lot of tokes every day), going directly to the consumer. @AnthropicAI has a long journey to become consumet-centric.
English
0
0
0
82
Olivia Moore
Olivia Moore@omooretweets·
Between this and the ChatGPT Atlas launch, we’re seeing a big push to “own” the consumer. IMO, Claude’s implementation of models is generally more elegant/powerful - but OpenAI’s products are more consumer accessible. It’s going to be an interesting few months!
Claude@claudeai

Claude Desktop is now generally available. New on Mac: Capture screenshots, click windows to share context, and press Caps Lock to talk to Claude aloud.

English
73
86
1.5K
218.4K
Javier Marin
Javier Marin@jamarinval·
7/ Final Thought: The model that wins public benchmarks isn't always the model that wins on your balance sheet. Measure what matters to your business, not what matters to Hugging Face. The answer will surprise you.
English
0
0
0
24
Javier Marin
Javier Marin@jamarinval·
6/ What to Do: Run your own benchmark on your data. Takes 3-4 weeks. Costs $200-1000 in tooling. Typically worth 10-50x the investment. Use this as your framework, not your oracle.
English
1
0
0
30
Javier Marin
Javier Marin@jamarinval·
Thread: We benchmarked @claudeai , @ChatGPTapp , @GeminiApp & @grok using a rigorous experimental framework. Public leaderboards didn't prepare us for what we found. 🧵 1/ The Premise: Benchmarks drove AI's rapid progress. MMLU, HumanEval, SuperGLUE—they work. But there's a critical gap: they measure general capability, not specific business fit.
Javier Marin tweet media
English
2
0
0
41
Javier Marin
Javier Marin@jamarinval·
💯 @reidhoffman Europeans get called "hyper-regulators" all the time — they say we write rules while others innovate. If you look at the EU AI Act through a historical lens you realize it's just common sense. We've seen this playbook before. New tech emerges → moves fast → breaks things → society pays the price → then we scramble to fix it. The AI Act isn't about stifling innovation. It's about learning from history and getting ahead of the curve this time.
English
0
0
0
57
Reid Hoffman
Reid Hoffman@reidhoffman·
1/ I want to state plainly: in all industries, especially in AI, it’s important to back the good guys. Anthropic is one of the good guys. More thoughts about why we need to fuel innovation and talk safety at the same time:
English
574
56
885
2.4M
Javier Marin
Javier Marin@jamarinval·
💬 Very common discussion: " I need clients to get real project to show to potential clients to get clients". 🛫It's a cold-start problem. Every marketplace, every two-sided network, every service business faces this exact dynamic. We call it 🐣"The chicken-and-egg problem". I think this problem is only unsolvable if you treat chickens and eggs as equally important. They're not. One creates disproportionate leverage. Most consultants think: "I need the egg to get the chicken." Wrong. You need a different kind of chicken 🐔 that doesn't require an egg.
English
0
0
0
37
Javier Marin
Javier Marin@jamarinval·
@emollick 🔥 Why burn cash on ads? Synthetic consumers let you A/B test campaigns for pennies, disrupting a $600B market. First-mover advantage awaits. Who’s building this unicorn? 💡 #AdTech #Startups
English
0
0
0
66
Ethan Mollick
Ethan Mollick@emollick·
This paper shows that you can predict actual purchase intent (90% accuracy) by asking an LLM to impersonate a customer with a demographic profile, giving it a product & having it give its impressions, which another AI rates. No fine-tuning or training & beats classic ML methods.
Ethan Mollick tweet mediaEthan Mollick tweet mediaEthan Mollick tweet media
English
141
685
7.7K
901.5K
Javier Marin
Javier Marin@jamarinval·
@ChrSzegedy Agree: it’s not about shiny new algorithms but about rolling up sleeves to apply AI in the real world.
English
0
0
1
51
Christian Szegedy
Christian Szegedy@ChrSzegedy·
AI is at the stage of transportation during the steam-engine railroad era: clunky, expensive, unreliable, and inflexible. However, diesel engines, cars, airplanes, and rockets will soon emerge, with progress now 30 times faster than then.
English
20
22
237
22.9K