Ujjwal
741 posts

Ujjwal
@UjjwalCodes
Lead Associate at Genpact | ⚽️ Barça | Gen-AI Engineering




The `neural-txt` library is now open. It locally runs a 0.1B model and does a variety of super useful NLP extraction tasks. Repo: github.com/avbiswas/neura… Runs at ~300 tok/s with 0.5GB peak. Supports markdown output + structured outputs. Tasks: - bullet points extraction - knowledge graph triplets - list of questions extraction - list of question-answer pairs extraction - retrieval reranking - rephrasing/elaboration - continuation of text - check readme for more Limitations: - Currently it only supports academic/technical passages. Expects a passage of about 150-200 words. - Model does not understand if you are bullshittin it. If you give it trash, it will just say trash. - It is a 0.1B model designed for superfast local inference - not a generic knowledge model. Read readme for understanding API - Model is WIP. I am doing this for a 4-part youtube course on post-training. SFT video is next week. This model will still go through DPO and RL finetuning in May. Sub on YT for upcoming tutorial: @avb_fj" target="_blank" rel="nofollow noopener">youtube.com/@avb_fj
All dataset and model links are attached below




Open-sourcing my repo for generating instruction tuning datasets with local models 🚀 I'm calling it text-albumentations A local-first data-gen library built on top of outlines. It contains universal task recipes for generating SFT data: - qa pairs - passage to questions - passage + questions -> answers - retrieval tasks - summarization - bullet point generation - rephrasing/elaboration - comparing two passages - continuation and filling blanks - knowledge graph triplets - more to come... How does it generate good data with local models? - It uses outlines. A constrained decoding library that enforces that generation happens in your expected format. This structured data then gets exploded into a multi-row aplaca-format dataset with variations and augmentations. Create your own custom pipeline - That's easy, just generate the pydantic basemodel schema, and define how your output gets converted into an alpaca format instruction tuning dataset I will be preparing documentation on how to do this Currently it supports: - mlx - transformers - openai and openai compatible apis Upcoming work: - batch processing - more data formats - more task primitives - better docs and more examples github.com/avbiswas/text-…













