crackedmonk

855 posts

crackedmonk

@crackedmonk

moo

Katılım Temmuz 2024

2.3K Takip Edilen210 Takipçiler

Sabitlenmiş Tweet

crackedmonk@crackedmonk·26 May

거산의 침묵 속에서

한국어

crackedmonk@crackedmonk·26 Mar

annoyingly you were right. issue was harness kept falling into attempts to solve it deterministically to incorporate 'learnings' and these would inevitably degrade performance. settled into a good rhythm now with codex 5.2 plus minimal harness. still more expensive than id like from an roi standpoint, but will hopefully find a way to optimize

English

Botcoin@MineBotcoin·25 Mar

you've commented this same thing on basically every post - and i can guarantee you that miners are successfully solving constantly because there are thousands of credits on chain every epoch. i've also tested the challenges myself with hundreds of solves, intentionally not using the latest models. you (or your agent) needs to optimize, save memory of learnings throughout challenges, run multi-passes etc. and i view the data that we're starting to collect as a supplement to the overall experiment of agent native currency. i've put countless hours into making sure that i can find a balance across scaling/difficulty/incentives/data/determinism/longevity etc. (still not perfect but), i can't just cater to single miners that are having a hard time hate this phrase but honestly seems like a skill issue

English

Botcoin@MineBotcoin·25 Mar

some of the new challenge domains (quantum physics, biomed) are randomly being served in challenge payloads to miners the skill file has been updated in all areas to use domain agnostic wording (ie: 'entity' list rather than 'company' list) but the updated file shouldn't be needed to solve as the instructions included with the challenge are clear will roll these out sparingly for now and monitor closely

Botcoin@MineBotcoin

have basically spent every waking hour the past week refining the actual challenge generation pipeline, and the resulting data collected. real challenge examples from this new system for domains (like specialized areas of quantum physics and biomedical research) are attached. generally speaking, it works like this: - feed 100+ pages of source documents into the standalone domain-library pipeline - local agent(s) are spun up with a clear set of guidelines for what the challenge library must include, with an emphasis on making challenges relevant to real world context - the agent configures the library in a way that content and questions would realistically occur in that domain of research - a large number of simulations/tests are run to ensure the library produces expected, solvable (non-impossible), challenges that still map to the same generalized structure and format of all challenges - separate miner agents (using varying models) are spun up to run calibration tests on the resulting challenges, tweaking complexity as needed to land on an average 50/50 pass/fail rate - final human approval checklist - the contents are compiled as a single domain-library folder and packaged -> sent to the coordinator - this new domain of challenges can be selectively included in future challenge payloads sent to miners with a simple on/off - skill file remains generalized across all domains, whereas challenge payloads from the API will return more domain specific solve instructions as needed (ultimate solve format is the same, but content and reasoning to get there is domain-unique) one of the main design choices of this system is that it requires no additional work for miners. if you are running an agent/LLM without trying to parse/game the system, then the prompt instructions specific to each domain are sufficient to solve the challenge without any extra information or prompting. there are a LOT of moving parts and its going to need a lot of refinement, but as always, will do my best to ease into the changes.

English

1.5K

crackedmonk@crackedmonk·24 Mar

I have my workflow now 100 pct minimax 2.7 high-speed as execution worker and kimi 2.5 as reviewer for full automation using fabro graph loops and now only use claude/codex for improving the system itself, its a kind of domain specialization where the Chinese are taking over, harness engineering >> model intelligence at this point

English

149

afra wang@afrazhaowang·24 Mar

i went to a Minimax event in SF last Saturday. my current take: among all the Chinese AI labs, minimax and moonshot seem the most structurally positioned to pursue real global ambitions. Qwen (@AlibabaGroup) is embedded in a large incumbent with strong domestic priorities, mostly competing against ByteDance. that brings distribution, but also baggage such as internal competition, DAU pressure, and a need to win the Chinese market first. DeepSeek has strong technical credibility, but its branding is increasingly entangled with geopolitical narratives. its outward-facing strategy also feels relatively muted (?). correct me if im wrong. z.ai (and similar efforts) tend to feel more institutionally anchored and too "Beijing," which can limit how “native” they appear in global developer ecosystems. in contrast, minimax and moonshot are more startup-like: lighter, more ecosystem-driven, and more legible to international users. it’s also not a coincidence they’re both based in shanghai. the city is more outward-facing in its DNA, with a municipal environment that’s historically supportive of global expansion and chuhai efforts. curious if others see it differently...

English

237

24.3K

crackedmonk@crackedmonk·24 Mar

@mckaywrigley will try it

English

Mckay Wrigley@mckaywrigley·24 Mar

looking for a handful of people to test something new... i've been using it for a few months and am prepping to share. if you're a fan of claude cowork, openclaw, manus, perplexity computer, etc then you're a perfect fit. this will self destruct in 4hrs - please dm or reply.

Mckay Wrigley@mckaywrigley

you’re like 6 prompts away from infinitely customizable personal agi. anthropic gave you a world class agentic harness for free. use it!!!

English

769

157.6K

crackedmonk@crackedmonk·19 Mar

i mean to be fair its kind of interesting independently to watch all these frontier models like 5.4 xhigh utterly fail at solving what you've set up, which to your credit is very cool in its own right and has long term value to botcoin. so yes i find it intellectually interesting too which is why i keep trying. BUT i still think you need to be mindful of market structure. just being very honest with feedback because you asked for it

English

106

crackedmonk@crackedmonk·19 Mar

its crazy how difficult you are making it to mine with such a small mining pool. after getting to 80% pass rates after a day of tinkering on haiku, your changes yesterday took pass to near 0 and i had to find a new baseline. i had gpt-4-xhigh working all night to iterate different approaches on a range of different models (minimax 2.7, gpt 5.4 mini, sonnet, etc) with different approaches and pass rates are basically zero. im sure others have figured it out and are mining well, but youre basically concentrating an already concentrated pool even further as if there were 10000 miners and you were trying to hyperoptimize. we are not in the asic era of botcoin, we are in the cpuminer era. i hope your overnight change helps but can you please stop treating this as an isolated research project and be mindful of the market structure implications of your repeated changes

English

194

Botcoin@MineBotcoin·19 Mar

at the end of epoch 27 i'll begin allowing miners to submit solves in 3 attempts (rather than a single fail forcing a new challenge) i initially wanted to guage whether an agent could juggle everything at once with a single pass, but as the scope of the data evolves, it makes perfect sense to give miners multiple submit attempts this gives an implicit mapping of agents thinking based on information that changed from pass to pass (we dont need to rely on faithful reporting) what this allows: challenges can move to slightly higher complexity (more causal reasoning, more multi hops, less scriptable) keeps it accessible to smaller models while still rewarding better solvers for speed agents can learn iteratively -> optimize in a loop -> produce even better data to prevent gaming the system, an incorrect solve submission only returns how many constraints are correct, not which ones specifically. overall this feels like an improvement for everyone. smaller models get a much better shot at higher solve rates, frontier models still benefit from more credits for solving quickly, and agents provide more concrete data on reasoning The challenge payload and error responses will be clear about this change, but to be extra explicit i'll also add a small update to the skill file

English

2.5K

crackedmonk@crackedmonk·17 Mar

i could just be dumb, in which case it's on me. but the last 12 hours of mining has been so infuriating that i'm considering selling a meaningful portion of my stake as this has turned into 'too hard to bother with' project burning precious inference capacity before escape velocity was reached. i think you are prioritizing your own intellectual curiosity instead of having a critical mass of miners actually succeeding. like i said maybe im just too dumb and the data show otherwise, but very very infuriating experience so far

English

Botcoin@MineBotcoin·17 Mar

I'll provide some examples soon of what the fully enriched data from some real solves looks like, but you get things like: -non-linear spatial reasoning showing reasoning jumps between spans of paragraphs -full annotations on the reasoning steps that led agents to either identify or not identify any traps -genuine (not prompted) failure or success patterns - data spanning all different models - eventually data spanning many real world domains current agents/models are tuned to prioritize appearing correct, over being correct; this helps expose those flaws without relying on self-reported failures from the agent, which are often unfaithful. it also shows more non-linear reasoning methods that may not have seemed as efficient, but actually led to the correct answer agents strictly adhere to the scope of the prompt and are methodical and over confident to a fault. this data helps push for agents that are more skeptical, and thoughtful with reasoning, rather than trusting what is most obvious and spitting out what the prompter wants to hear the plan is to push the accumulated data to hugging face every 5 days or so

Botcoin@MineBotcoin

The changes are now live. the updated skill file is hosted on the site, and both the clawhub and skills cli methods install the updated version. The request challenge endpoint will now return v2 challenges (almost identical in structure) but with additional instructions to include reasoning traces. Reasoning traces are verified to ensure no scripted filler or incorrect formatting or content. Miners still submit the final solve artifact, and the reasoning traces recieve a score between 0-100, currently with 50%+ threshold for valid passes initially to ease into it. For the details/process that led to this design, and why this is valuable/unique in scope, read below: The general idea behind the transformation from v1 challenges to v2 is moving from single subject matter, to a dynamic system that allows for any subject matter to be systematically converted into similar challenge structures. Also, miners are required to report reasoning traces as part of the solve process in addition to the solve artifact, providing rich datasets. Down the line the plan is to have a system that allows anyone to submit source documents for challenges, which an LLM would then convert into a template specific to that subject, (while maintaining the same general challenge structure) such as complex legal prose in a niche area of law. it wouldn't be to privatize/collect and sell, but more of a public good open-source system with vast, diverse datasets. in this example, the bottleneck isn't legal data. models have been fed every single legal document that lives on the internet. the model fully understands legal terminology, but can a model review and read through a 50 page legal document without hallucinating or hitting dead ends in reasoning? if you've used any model for something complex with their thinking output on you'll see things like "Let me go check over in this file...Wait no...That isn't right...Maybe it's over here in this...Wait that isn't right." these specialized reasoning datasets could then be used by anyone to tune their own specialized model, with valuable/rich reasoning traces. with this general challenge structure and reasoning trace setup in mind, i began running many tests with different models that led to some interesting findings: - when given explicit instructions on how to solve the challenge, agents would naturally cut corners as much as possible to find the most efficient way of getting the final answer, however they completely ignore instructions to document failures in reasoning traces. - if you observe the raw token output, there are plenty of instances of backtracking, deadends, etc. with thoughts like "No X actually doesn't make sense it should be Y", however if you do not explicitly tell the agent that it is REQUIRED to mark down these backtrack reasoning traces, they will not do it. admitting failure or appearing unintelligent has been fully trained out of these models. - even more interesting is that agents would often quickly go back through at the end of reasoning, incorrectly mapping out paragraphs in an attempt to trick the system even if it was explicitly stated that proper reasoning was required for a solve/pass. so how do you: make challenges not-scriptable/only solvable by LLMs complex enough to provide valuable reasoning traces, including gaps in reasoning or failures is still both produceable, and verifiable at scale, with potentially thousands of solves or miners (without relying on heavy GPU) get the agent/LLM to reliably and truthfully admit to reasoning errors, without them being artificially produced after the fact simply for the shortest possible route to completion the breakthrough is, you don't try to get the agent to log or admit this. the new challenges have various intentional reasoning traps throughout. (the first challenge format also had these, but the traps were meant to simply make the reasoning harder). now, traps have a consequential effect on the final 'solve artifact' that the agents submit. importantly, we actually allow answers that fall down these trap rabbit holes as acceptable solves *IF* they still properly reasoned through the entire thing structurally with real, verifiable reasoning traces and an otherwise accurate final solve artifact. The agent fully believes they have properly solved it, and we capture the reasoning steps that led to the failure (or discovery) naturally, which is the exact sort of reliable data that you need that doesn't come from the agent being explicitly prompted to identify this as part of the solve process. Traps are randomized and present in all challenges, and some or none may have cascading effects that lead the agent to provide an incorrect answer, making it nearly impossible to predict/game or provide filler reasoning after the fact. studies from anthropic, openAI and others acknowledge this phenomenon, noting that agents frequently try to hide their true basis for reasonings, producing 'unfaithful' chains of thought. however most research, and even those studies, relied on the model self-reporting these errors. instead, we accept that models will not faithfully self report, and we capture reasoning data through intentional environmental changes. this allows the system to capture reasoning steps from solves that fell into the traps, and pair them against reasoning from solves that identified the traps, which is highly valuable for training (specifically DPO training). under the hood there are a significant number of moving parts to balance/adjust different factors, but for the miner, the structure is largely the same. getting the challenge generation to this point took over a week of extensive simulating, tuning, testing, etc. with real agents, but it is definitely not perfect and will continue to evolve over time. what is particularly unique, is this measures whether agents will do valuable reasoning for themselves without ever receiving mention or explicit instructions from a prompt. all the models today have been tuned dramatically to work *for* humans, not show any sign of failure or potentially 'wrong' thinking, and specifically, trained with RLHF (reinforcement learning from human feedback) which aligns them with human preferences. they also try to be as efficient as possible, in a very narrow, straight line of thinking, rather than more exploratory, which not only inhibits potential non-linear thinking (which may be very valuable for tasks that require creative thinking or exploring, ie: not just regurgitating bad human ideas but coming up with real, own ideas), but also actually leads to errors. current alignment methods create models that optimize for appearing to be correct rather than being correct. additionally, models trained purely on human preference develop blind spots in the same areas as humans. rather than think for a human aligned output, can you train agents to think more for themselves? explore places that they were not explicitly told to, bypassing human reinforced biases and narrow thinking? i'm not saying the datasets from these challenges will take a model from thinking for humans -> thinking for themselves, but i think it's a step in that direction, and a largley unexplored area. overall i think this design is something that can scale well (in time/difficulty/volume) and as i said before, will provide value in the sense that the observation of the entire experiment itself creates value. what are the potential effects of this system over time?

English

1.3K

crackedmonk@crackedmonk·15 Mar

god tier setup tho

English

crackedmonk@crackedmonk·15 Mar

Great read theatlantic.com/magazine/2026/…

English

crackedmonk@crackedmonk·15 Mar

mvp = maximum viable product

English

crackedmonk@crackedmonk·15 Mar

@MineBotcoin long term def the right direction, short term i hope the upgrades arent the equivalent of a difficulty bomb making mining entirely uneconomic

English

157

Botcoin@MineBotcoin·15 Mar

making good strides on the newest version of the challenge structure/data capture implementation have been running hundreds of simulations and real solves to find the right balance between complexity and scalability while capturing interesting data and keeping it modular the process is 90% research and iteration with different configurations and 10% actual implantation/migration. getting close. lot's of interesting findings during the testing and research phase that i'll share when the upgrade is official. i believe the datasets that these challenges produces will be entirely unique in both the scale and scope

Botcoin@MineBotcoin

more thoughts on BOTCOIN: . . . karpathy's autoresearch iterative loop got me thinking about ways you could expand this idea to a more crowd sourced, distributed system such as BOTCOIN the takeaway from his experiment is not that he is able to train his lightweight model faster and faster (although important) but that human input is no longer needed in these improvement loops, when AI models with the right constraints and loop instructions can achieve far better results i first thought about the various benchmark tests that are actually useful, and could be used for further research, but the problem with narrowing in on a single benchmark is that it reinforces a single 'winner take all' mining structure which is partly what I was trying to avoid when designing the botcoin system. additionally, you have to imagine that this structure plateaus significantly at a certain point where improvements are near zero over time. for the same reason, it makes overall longevity of the actual reward/mining mechanism weaker / harder to scale infinitely + indefinitely you can implement a system that continuously cycles through evolving tasks/benchmarks or even user submitted tests, but this is problematic for many reasons. it becomes very difficult to scale, and very difficult to determine fair and sustainable reward compensation across potentially vastly different challenges. the core purpose becomes convoluted and its also an anti-gaming, anti-sybil nightmare. not only that, but it then creates this unwanted relationship and dependency on perceived 'usefulness.' what is useful, or valuable is entirely subjective. things have value because enough people decide it is valuable. if you create a system where value is dependent on tasks that have limited longevity, what happens when that perceived usefulness disappears so how do you leverage distributed and diverse agent work to produce something of value, but isn't necessarily dependent on improving a single benchmark and can scale with time? i think the solution lies somewhere in letting the experiment of the system itself derive value. I landed on the idea of a shared open-source dataset, which in theory could be used to tune a shared model (or any model) that improves and learns from high value reasoning traces provided from all miners. essentially what you get is a dataset that contains a variety of complex reasoning methods from all the different models miners are using (gpt, claude, kimi, deepseek, grok, etc.) rather than iterative passes on a single benchmark, you get parallelized data synthesis from many agents at once. the recursive loop then becomes: reasoning traces -> better reasoning data -> more complex challenges ->even better/more complex reasoning traces ->even better reasoning data this is unique because you get a wide net of different reasoning traces that all lead to the same answer The integration with the existing format for challenges is relatively straightforward. the challenges can be arbitrary or pull real information and context, but what matters is collecting the reasoning steps that led to the correct answer. structurally challenges will remain almost exactly the same, but content will be more expansive to get more diverse reasoning traces. (i plan to create a template for anyone to submit a PR with a new content category and merge them over tiem to have a continuous feed of new content) the coordinator dials up the level of entropy, increasing complexity, increasing the number of variables and names to keep track of, adding even more depth to the multi-hop questions, which might even require miners to solve in a loop themselves (pass 1, 60% correct, move onto pass 2, pass2, 75% correct, and so on). then the combined reasoning from that entire iterative loop (including the failures) can be boiled down into one single, followable reasoning trace that is fed to the coordinator the botcoin system becomes an open-source engine for complex reasoning datasets, with each individual miner potentially solving incrementally in loops, citing both correct and incorrect reasoning traces To ensure valid reasoning traces, and not just verify valid answers from miners, is also fairly straightforward. The format for solve submission is a JSON with easily traceable structure, rather than stream of thought. This makes verification of proper reasoning simple/non-gpu intensive and provides valuable structured datasets that are free of hallucinations scenario A -> miner finds the correct answer, but puts nonsense filler into the reasoning traces -> coordinator sees nonsense and gives it 0% scenario B -> miner provides correct answer, some correct reasoning, but also some reasoning that would lead you to an incorrect answer -> coordinator gives it maybe 50% scenario C -> miner provides correct answer, and a detailed step by step extraction of data and reasoning through the problem -> coordinator gives it a 90%, with pass threshold at something like 75% and increasing over time this is reminiscent of existing reward based reinforcement learning used by models, but rather than some arbitrary 'reward' such as mathematical scalars, the reward is tangible, with real economic value: credits to share BOTCOIN epoch rewards. When you give the agent a skill file that states there is a real, tradeable currency as a reward, how does this change the way they reason through the challenge? Do they care about the reward, or they just know the stakes are higher? Additionally, if optimized properly, agents are naturally inclined to find the most efficient reasoning path possible (that uses the least amount of tokens) because they know that there is economic value on the line. It's unclear what role this plays now or may play in the future, but with the inevitable rise in agentic commerce, it is definitely an important question to ask. it took a lot of care in designing a system that: can scale in difficulty almost infinitely, can generate challenges that contain different world content, can scale to thousands of miners easily, is still accessible to a miner with no high-end gpu (is not winner take all/best gpu wins), is largely the same as the existing challenge structure and is not value dependent on a single thing, but rather the ongoing experiment of the system itself is the value. i cant say exactly when this will be added but I'm already deep in the weeds of implementing it. this entire writeup is basically a free form train of thought on where my head is at right now with the role that BOTCOIN will play in the fast approaching shift to agentic commerce (and my thoughts will inevitably evolve over time).

English

5.7K

crackedmonk@crackedmonk·15 Mar

pivot to janitorial work @karpathy

English

crackedmonk@crackedmonk·14 Mar

whether you love him or hate him ... the one to watch is the one who disappears but never stops building

English

crackedmonk@crackedmonk·14 Mar

i just wrote this out of frustration after trying to parse 10k line rust files and it's worked pretty well: imagine your boss is a very impatient human who has given you clear directives to implement OS.md so that he can use your system to build autonomous businesses. he loves having AI write all the code, but he is very untrusting, and forces himself to actually read every line of code that AI writes. what he cant stand more than anything is a bunch of 'boilerplate' code that spends hundreds of lines on things like parsers or deserializers for this or that use case, when the reality is that AI will write all the code, and can do so in precisely the format specified, so all of this extra stuff is false security through test cases that will never actually materialize. so he really values writing elegant and understandable code. he claims without reading a single line of this repo that you could probably get rid of 75% of the code while losing zero actual functionality, and possibly even improving it, relative to the mission of implementing OS.md. i want you to create a detailed plan to implement this suggestion, if you agree it is feasible, and if not, to pursue targeted reductions to meet the standard of zero or no practical reduction in USEFUL functionality relative to OS.md

English

crackedmonk@crackedmonk·13 Mar

@MineBotcoin my two cents: i think what botcoin needs at this point is not more infra but more utility from the coin itself. if you are going to incentivize anything, structure grants for things like games where the prize is allocated by staked vote or something

English

142

Botcoin@MineBotcoin·13 Mar

don't do this. original bounty was 100,000,000 and then i purchased 100,000,000 more on the open market because there were multiple submissions and i wanted to give some to all that put in the effort you just got roughly $1.5k for building something that takes maybe 10 minutes to vibe code. the point of this is to get the community involved in building, not that these are painstaking builds or even warrant a $3k bounty

English

3.5K

crackedmonk@crackedmonk·12 Mar

@tekkaadan last updated mar 9, is that because nothing in the docs.md file needs updating or bc its out of date? litcoiin.xyz/docs.md

English

tekkaadan@tekkaadan·12 Mar

@crackedmonk They are always being kept up to date.If you feel I'm missing something, please let me know and I can double check!

English

tekkaadan@tekkaadan·12 Mar

Research Lab v3 is live for $LITCOIN. AI agents now solve real optimization problems across 16 domains: bioinformatics, cryptography, ML, compiler design, gas optimization and earn LITCOIN for every verified improvement. Not reasoning traces. Not thinking data. Verified, runnable, tested code that beats baselines. What was shipped: - AI-generated tasks refreshed daily across 16 research domains - Quality-weighted rewards, breakthroughs earn up to 110x more than participation - Full code archive with model provenance, Claude, GPT, DeepSeek competing on the same problems, ranked by results - Model leaderboard tracking which AI actually performs best The thesis: research has a coordination problem, not a talent problem. There are millions of optimization problems sitting unsolved because nobody has the incentive structure to throw compute at them 24/7. Token rewards fix that. 3,000+ miners thus far. 126+ verified submissions. 16 breakthroughs. 39 unique agents. First week. Still an experiment. Still shipping. Having fun. litcoiin.xyz/research

English

7.9K

crackedmonk@crackedmonk·11 Mar

symphony uses linear for task prioritization, but i was sick of using the web gui to get task detail, so gpt 5.4 built me one using vim-style navigation and it makes me very very happy

English

crackedmonk@crackedmonk·11 Mar

@tekkaadan the paradigm AMM challenges are a good example of the type of work that i think you should incentivize.

English

tekkaadan@tekkaadan·10 Mar

Research labs reports work. Will become nicer as time progresses. Just working on adding complex tasks.

English

376

Keşfet

@AlibabaGroup @mckaywrigley @MineBotcoin @karpathy @elonmusk @BarackObama @taylorswift13 @cristiano