Unstructured

1.5K posts

Unstructured banner
Unstructured

Unstructured

@UnstructuredIO

Stop dilly-dallying. Get your data. 👉🏼 Get Started: https://t.co/7Phj5PbxNU

San Francisco, CA Katılım Ağustos 2022
153 Takip Edilen6.3K Takipçiler
Unstructured
Unstructured@UnstructuredIO·
When @pinecone built Nexus—their new knowledge engine for agents—they turned to @UnstructuredIO to deliver the world’s best machine-readable enterprise content. There’s a reason for that. Extracting clean, fidelity-preserved signal from PDFs, contracts, and presentations at enterprise scale is a hard, solved problem for exactly one company. Better content, smarter agents. That’s the combination. Congrats to the Pinecone team on the launch. →
Pinecone@pinecone

What happens when the primary users of AI aren't humans, but agents? This week: A new language, new engine, new marketplace, new economics, and new reach. The agentic era needs its own infrastructure category. Our CEO @ashashutosh Founder @EdoLiberty on what it is and what comes next. With perspectives from @Box, @llama_index, @LangChain, @Teradata, @ThoughtFocusTec, and @UnstructuredIO. pinecone.io/blog/knowledge…

English
0
0
0
200
Unstructured
Unstructured@UnstructuredIO·
What happens if we look inside a model? Recently we combined two high quality document datasets hoping to train a better model. Something was off but the usual metrics weren't telling us much. So we looked inside. We visualized how the model was organizing document elements in its feature space. The clusters were messy. Categories that should be cleanly separated were bleeding into each other. The model had learned two conflicting definitions of the same things and had no way to resolve them. Once we fixed the underlying issue, the picture changed completely. Tighter clusters, cleaner separation and our layout model actually knew what it was looking at. Check out our latest work on Agentic Harmonization → unstructured.io/blog/how-we-ta… Paper: arxiv.org/pdf/2604.11042 #MachineLearning #AI #DocumentAI#Finetuning #TrainingData #DataQuality #AnnotationConsistency#DocumentParsing #Unstructured
Unstructured tweet media
English
0
0
0
115
Unstructured
Unstructured@UnstructuredIO·
The theme of @IBM Think 2026 is the agentic leap. The move from AI that answers questions to AI that takes action — coordinating workflows, making decisions, operating across the enterprise. What rarely gets talked about is what that leap actually requires underneath. Agents are only as capable as the data they can reach. And most of the knowledge enterprises actually run on still lives in formats agents can't reason over: scanned PDFs, complex tables, contracts, emails, presentations. That's the layer we work on. We're heading to Boston this week for IBM Think 2026, where we'll be showing how @UnstructuredIO turns the full enterprise data estate into AI-ready context for agentic systems and RAG. If you're building production AI and wrestling with how to actually get your unstructured data into it, come talk to us. 📍 Booth #710 #AI #GenAI #AgenticAI #RAG #EnterpriseAI #UnstructuredData #IBMThink #IBMwatsonx #Unstructured #IBM
Unstructured tweet media
English
0
0
1
111
Unstructured
Unstructured@UnstructuredIO·
We added more data to our training pipeline and it made our models worse. Detection slipped, Table structure and Reading order slipped across the board. Both datasets were high quality. The labels looked clean. The categories overlapped. There was no obvious reason for it. The issue was that both datasets had completely different ideas about what the same label actually meant. One drew a tight box around a paragraph while the other also included some whitespace. The model absorbed both as ground truth and learned to be inconsistent. The more it trained, the more confused it got. That's annotation inconsistency. And it doesn't just show up in predictions. It shows up in how the model internally represents layout too. Categories that should be clearly separated start bleeding into each other. More data made things worse because the data was disagreeing with itself. Check out our latest work on Agentic Harmonization → unstructured.io/blog/how-we-ta… Paper: arxiv.org/pdf/2604.11042 #MachineLearning #AI #DocumentAI#Finetuning #TrainingData #DataQuality #AnnotationConsistency#DocumentParsing #Unstructured
Unstructured tweet media
English
0
1
2
278
Unstructured
Unstructured@UnstructuredIO·
Headed to NLIT Summit next week? Make sure to swing by Booth #433 to learn how @UnstructuredIO transforms messy, multimodal content into clean, structured, AI-ready data. And be sure to catch @mollzzzzzz_ on Wed 5/6 at 1 PM in Room 2206. She'll be speaking on Multimodal Data Preparation for Agentic AI. 📆 When: Mon 5/4 - Wed 5/6 📍 Where: Kansas City, MO 🔗 Book time with our team: unstructured.io/government?mod… #AI #GenAI #GovTech #DataEngineering #UnstructuredData #RAG #DocumentAI #Unstructured #TheGenAIDataCompany
Unstructured tweet media
English
0
0
0
76
Unstructured
Unstructured@UnstructuredIO·
What happens if we look inside a model? Recently we combined two high quality document datasets hoping to train a better model. Something was off but the usual metrics weren't telling us much. So we looked inside. We visualized how the model was organizing document elements in its feature space. The clusters were messy. Categories that should be cleanly separated were bleeding into each other. The model had learned two conflicting definitions of the same things and had no way to resolve them. Once we fixed the underlying issue, the picture changed completely. Tighter clusters, cleaner separation and our layout model actually knew what it was looking at. Check out our latest work on Agentic Harmonization → unstructured.io/blog/how-we-ta… Paper: arxiv.org/pdf/2604.11042 #MachineLearning #AI #DocumentAI#Finetuning #TrainingData #DataQuality #AnnotationConsistency#DocumentParsing #Unstructured
Unstructured tweet media
English
0
0
0
108
Unstructured
Unstructured@UnstructuredIO·
Enterprise-ready AI starts with getting your data right. @UnstructuredIO transforms your company's raw data and messy files into clean, structured AI-ready formats with high accuracy, relevance, context, and throughput. We also adhere to industry-standard security patterns and best practices, handling all of the back-end complexities for you in a hands-free way. 🙌 Learn more: youtube.com/watch?v=bZisZD…
YouTube video
YouTube
English
0
0
0
107
Unstructured
Unstructured@UnstructuredIO·
Roughly 90% of enterprise data is unstructured. PDFs, emails, scanned contracts, support transcripts, slide decks. None of it shows up in your warehouse. Most of it never makes it into your AI workflows. And as agentic systems start making real decisions on top of that data, the gap between "data you have" and "data your AI can actually use" becomes the thing that separates production from prototype. That's the work we're doing with @IBM . Next week, we're heading to IBMThink2026 in Boston to talk about exactly that — how the partnership between @UnstructuredIO and @IBMwatsonx is helping enterprises turn their full data estate into AI-ready fuel for RAG, agentic systems, and everything in between. If you're at Think and thinking about how to operationalize unstructured data for production AI, come find us. 📍 Booth #710 🗓️ May 4–7, Thomas M. Menino Convention & Exhibition Center, Boston #AI #GenAI #AgenticAI #RAG #EnterpriseAI #UnstructuredData #IBMThink #IBMwatsonx #Unstructured #IBM
Unstructured tweet media
English
0
0
0
86
Unstructured
Unstructured@UnstructuredIO·
Let's say you now have an Unstructured workflow that uses your source documents as input and delivers Unstructured's AI-ready JSON outputs based on those documents into an Amazon S3 bucket. But how do you unlock all of that output's rich metadata, context, relevance, and meaning to make quicker and more confident decisions and to complete related tasks faster and easier? Well, we've just published instructions showing you how to connect Claude Desktop to your Amazon S3 buckets. Claude Desktop allows you to use natural language to start chatting with Unstructured AI-ready JSON outputs or feeding those outputs into your agentic AI workflows—with no code or programming required! Try it today: docs.unstructured.io/examplecode/to…
English
0
1
1
81
Unstructured
Unstructured@UnstructuredIO·
If you haven't tried Unstructured yet, now's a good time. 15,000 free pages. No credit card. No expiration. Full access to every feature. Drop in a PDF, a PowerPoint, a CSV, whatever you've got, and see what comes out the other side. 👉 unstructured.io/letsgo
English
0
0
1
83
Unstructured
Unstructured@UnstructuredIO·
Let's say you now have an Unstructured workflow that uses your source documents as input and delivers Unstructured's AI-ready outputs based on those documents into a DataStax Astra DB collection. But how do you unlock all of that output's rich metadata, context, relevance, and meaning to make quicker and more confident decisions and to complete related tasks faster and easier? Well, we've just published instructions showing you how to connect Claude Desktop to your Astra DB collection. Claude Desktop allows you to use natural language to start chatting with Unstructured AI-ready outputs or feeding those outputs into your agentic AI workflows—with no code or programming required! Try it today: docs.unstructured.io/examplecode/to…
English
0
0
0
98
Unstructured
Unstructured@UnstructuredIO·
A walkthrough of Unstructured's structured data extractor in the user interface is now available. This demo extracts outputs from a tax form with cleanly labeled fields like taxpayer name, dependents, and line-item tax values based on a JSON schema you define. Flip on "schema only output" to get just the extracted records with smaller payloads, intuitive column names, and ready to drop into a database. This walkthrough is a great example of quickly and easily turning a form-based document into clean, query-ready data. Watch the 8-minute demo at youtube.com/watch?v=FZy6zX…
YouTube video
YouTube
English
0
0
0
147
Unstructured
Unstructured@UnstructuredIO·
A new walkthrough of using the Unstructured API (via Postman) to create, list, and get both workspace-scoped and workflow-scoped webhooks has just been posted. This demo creates two notification channels and then runs a workflow to send two "job completed" payloads to the receiver. You can extend this demo to wire send event notifications to email, Slack, AWS Lambda, Azure Functions, or any receiver of choice. Watch the 6-minute demo at youtube.com/watch?v=rc3aKH…
YouTube video
YouTube
English
0
1
1
116
Unstructured
Unstructured@UnstructuredIO·
Webhooks are live in Unstructured. When your jobs run, complete, or fail, a signal fires automatically to any endpoint you control. Slack, Lambda, or anything that accepts a POST request. Your downstream systems react the moment it happens. Docs: docs.unstructured.io/ui/webhooks
Unstructured tweet media
English
0
0
0
74
Unstructured
Unstructured@UnstructuredIO·
When you combine two datasets to train a model, what could go wrong? Turns out, a lot. We've been working on something we think is genuinely new in document AI. It's called agentic label harmonization. And it came out of a problem we kept running into: even when you have great data, combining datasets from different sources can quietly break your model. It's not obvious or dramatic. The loss converges. The metrics look reasonable. But the model somehow learns to become confused. Our new paper, Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization, digs into exactly why this happens and how we built a solution using an AI agent to reconcile annotations before training even begins. The result: better document detection, cleaner table extraction, and a model that actually understands layout instead of memorizing noise. Check out our latest paper → arxiv.org/pdf/2604.11042 Read the full blog → unstructured.io/blog/how-we-ta…
Unstructured tweet media
English
0
0
0
102
Unstructured
Unstructured@UnstructuredIO·
ICYMI: we put together a deep dive on advanced RAG techniques last year and it's still one of the most useful things we've published. Most teams get RAG working at a basic level pretty quickly. Vector search, chunking, a retrieval step. But then it plateaus. Responses feel off, retrieval misses things, and it's not obvious what to fix. That's what the guide gets into. Re-ranking, hybrid retrieval, query rewriting, multimodal enrichments, agentic workflows. The techniques that move you from "RAG is kind of working" to something that actually holds up in production for all of your use cases. Check it out 👉 unstructured.io/blog/rag-white… #RAG #AI #GenAI #DataEngineering #UnstructuredData #Unstructured #LLMs #AgenticAI #VectorDB
Unstructured tweet media
English
0
0
0
123
Unstructured
Unstructured@UnstructuredIO·
We've just posted a quick walkthrough of setting up Unstructured webhooks via the user interface. You'll see how to specify a webhook URL, pick the "job completed" event, save it, and immediately see POST payloads land when a workflow run finishes. From there, you can wire it up to downstream actions like email, SMS, or Slack, or swap in receivers like AWS Lambda or Azure Functions. It's a great way to turn pipeline events into real-time notifications or trigger the next step in your stack. Check out the 3-minute demo at youtube.com/watch?v=XubmPl…
YouTube video
YouTube
English
0
1
1
120