achal

102 posts

achal banner
achal

achal

@achalllll

making contributions @chonkieai

Katılım Şubat 2024
57 Takip Edilen25 Takipçiler
Sabitlenmiş Tweet
achal
achal@achalllll·
just made another open source contribution! got another pr merged into @ChonkieAI fixed an edge case in codechunker where leaf nodes could produce empty output, causing valid code to be dropped. now leaf nodes are handled as safe fallback units so all input code is preserved.
achal tweet media
English
2
1
6
401
achal
achal@achalllll·
>also expands documents and queries with contextually fitting words. > addresses the vocabulary mismatch problem when related semantically texts use different words which is impossible to do using BM25.
English
0
0
1
10
achal
achal@achalllll·
exploring sparse lexical and expansion model (SPLADE) >it take encoder representation from transformer or bert as basis and works mainly for english language. >SPLADE attempts to encode the meaning of the keywords already present in the text.
achal tweet media
English
1
0
1
31
achal
achal@achalllll·
> Sparse vectors in Qdrant are represented by: indices – the indices of non-zero dimensions (stored as uint32, so they can range from 0 to 4,294,967,295). values – the values of these non-zero dimensions (stored as a float).
English
0
0
2
3
achal
achal@achalllll·
reading about sparse vector > unlike dense vectors, sparse vectors contains many zero values and it doesn't make sense storing all those zero values, so reading about how to only store non-zero values so that we can save memory usage, searching cost
achal tweet media
English
3
0
3
31
achal
achal@achalllll·
> we don’t need to define size or distance metric for sparse vectors since - size varies based on the number of non-zero elements in the sparse vector -and distance metric for comparing sparse vectors is always the Dot product.
English
0
0
2
9
achal retweetledi
samyak
samyak@smykx·
@a1zhang and @raw_works 's experiments have been super fun to follow. in this article i tried to cover their work around LongCot benchmarking, also threw some light on alex's "mismanaged geniuses hypothesis" i tried it keeping it easy to follow. hope everyone enjoys the read :)
Quarq@quarqlabs

2 weeks ago @raw_works's profile published an announcement about hitting state-of-the-art on LongCoT. A relatively small model like Qwen3.5-9B beat GPT-5.2 on a long-horizon reasoning benchmark by over 60% using the right scaffold. That question is "Is true intelligence just locked behind the right scaffolding" First, What Is Long CoT, and Why Does It Matter? LongCoT is a benchmark for difficult reasoning problems. It is specifically designed to measure whether models can sustain coherent reasoning over extremely long horizons. The tasks span mathematics, chemistry, computer science, chess, and logic, where each individual reasoning step is usually within the capability of frontier models. The difficulty comes from maintaining correctness across a massive graph of interdependent steps that can stretch across tens to hundreds of thousands of reasoning tokens. These tasks break most models and act as a real test of complex task solving abilities. Let's talk about what @a1zhang (MIT CSAIL) published recently. Using a refined prompting setup within the RLM harness, they pushed performance on LongCoT-mini from 38.7% to 65.6%. A nearly 2x improvement on one of the hardest compositional reasoning benchmarks out there, just from better scaffold design. Earlier results with dspy.RLM on Claude Sonnet 4.5 showed a jump from roughly 13% to 45.4% overall. Specific categories like Dungeon, Packaging, Hanoi, Sudoku, and Wizards went from near-zero to perfect scores. Chess hit 85 out of 100. Then there's @raw_works's result: Qwen3.5-9B paired with dspy.RLM achieved 15.69% on LongCoT-Full compared to GPT-5.2's 9.83%. A 9 billion parameter open model beating one of the most capable frontier models available, by a meaningful margin, on a hard benchmark. The 27B variant ranked highly on the mini split too, beating models many times its size. It's Not Just LongCoT. This same pattern is showing up across benchmark categories. On LongMemEval, dspy.RLM variants are consistently hitting 87–89.8% accuracy. A model like Gemini 3 Flash paired with dspy.RLM and observational memory reached 89.8% at roughly $0.035 per query. That's approaching dedicated memory system benchmarks like Mastra (~95%) and Vectorize Hindsight (~91%), without any specialized memory architecture. On multi-hop reasoning tasks and large-context aggregation problems where you're slicing through 10 million+ tokens and need to pull out specific signals RLMs are outperforming both vanilla long-context models and traditional RAG setups. The Takeaway @a1zhang's The Mismanaged Geniuses Hypothesis is very apt here. These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management. Standard prompting essentially hands a genius a disorganized to-do list and wonders why they underperform. RLMs fix this. By giving the model a recursive execution environment a shared REPL state, typed inputs and outputs via DSPy signatures, structured delegation. The models we already have are more capable than our current interfaces allow them. RLMs, and DSPy's implementation in particular, are surfacing that latent capability at scale. It would be interesting to watch this space and see far RLMs take us. These are the sources which will allow you to go deeper: @a1zhang, @raw_works, alexzhang13.github.io alexzhang13.github.io

English
0
5
15
751
pranjal.txts
pranjal.txts@khichdiNcode·
@achalllll hmm, try qwen3.5:0.8b next you will be surprised. and then qwen3.5:4b and qwen3.5:2b
English
1
0
1
21
achal
achal@achalllll·
been trying to run slms locally and see how they actually perform. tested tinyllama, here are the results: > TTFT: 2.11s > TPS: 13.74 > total latency: 14.67s
English
2
0
6
55
achal
achal@achalllll·
exploring vector indexing >HNSW(navigable small world and skip linked list) graph based indexing > early stopping(local optima) >HNSW working >and how HNSW improves efficiency, scalability and accuracy > IVF(clustering based indexing)
achal tweet media
English
0
0
4
26
Nebula
Nebula@NebulaAI·
Kimi K2.6 just entered the chat. > beats gpt 5.4 and opus 4.6 on coding. > open source. open weight. > 3x-5x cheaper. > really good for people running agents. > insane at running long tasks. > we're talking 12+ hrs of coding. > can do thousands of tool calls in a session. > really designed for agentic tasks. > we make it easy for you to run it on our platform. > in fact, you can choose whatever model you want with us.
Kimi.ai@Kimi_Moonshot

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English
18
64
886
95K
achal retweetledi
Roohi K
Roohi K@roohi_kr·
The trailer for Jeston Lu's episode on @bizpodroohi Thanks @SanketTitare1 for the efforts on the trailer The full episode is coming in a few days, or by the end of the week
English
3
4
14
620
achal
achal@achalllll·
bro they got a whole village as authors in phi-3 technical report 💀
achal tweet media
English
0
0
5
157
achal
achal@achalllll·
After testing mistral, I tried Llama3.2:3b on the same setup… much faster..
achal tweet media
English
0
0
4
39
achal
achal@achalllll·
tried building a CLI to check model performance locally. looks like mistral 7b(quantized) is way too heavy for my system (8GB RAM) Latency went crazy high…
achal tweet media
English
0
0
3
33
achal
achal@achalllll·
running language models locally is easier than you think. I wrote a beginner-friendly guide on how to run Small Language Models (SLMs) on your laptop using Ollama, no complex setup needed. @aachaltitare/beginners-guide-to-running-small-language-models-locally-d6270a0a962b" target="_blank" rel="nofollow noopener">medium.com/@aachaltitare/…
English
0
1
7
350
achal
achal@achalllll·
>trained on 80% synthetic textbook data + 20% phi-1 data, with 3 variants (synthetic, web-only, mixed) >despite being 5–10x smaller shows strong performance on QA, knowledge, and reasoning tasks even competes with LLaMA-65B on math(GSM8K), coding >full results in the photos above
English
0
0
1
20
achal
achal@achalllll·
today i read "Textbooks Are All You Need II: phi-1.5 technical report" >idea was same as phi-1 that "you don't need massive models, you need better data" >size was also same as phi-1 - 1.3B >it mainly focuses on common sense reasoning, language skills, and multi-step reasoning
achal tweet mediaachal tweet mediaachal tweet mediaachal tweet media
English
1
0
3
64
achal
achal@achalllll·
another PR got merged into @ChonkieAI!✌️ >refactored the neural chunker to improve error handling >clean separation of input validation vs model loading >removed generic exception handling >added precise errors for tokenizer/model failures >better debugging with clearer messages
achal tweet mediaachal tweet media
English
1
0
4
89
achal
achal@achalllll·
>the result was crazy 50.6% accuracy on humaneval and 55.5% on MBPP >they used filtered real world code, synthetic textbooks (generated via GPT 3.5) to train the use of LLMs to train SLMs is really interesting >they first pretrain on data then finetune it on synthetic exercises
English
0
0
3
32
achal
achal@achalllll·
exploring slms started with "Textbooks Are All You Need (Phi-1)" research paper by microsoft >this paper shows, instead of making large model, if you train on high-quality, textbook-like data, even a small model can perform like a big one.
achal tweet media
English
1
0
3
38