Hyper-Extract: From Unstructured Text to Knowledge Graphs, Hypergraphs, and Beyond
Thereâs a quiet revolution happening in how we structure knowledge from text. For years, knowledge graphs have been the gold standard â triples of (subject, predicate, object) neatly capturing who did what to whom. But real-world knowledge isnât binary. A drug interaction involves a drug, a patient condition, a dosage, a contraindication, and a temporal window. A legal precedent connects a court, multiple parties, a statute, a jurisdiction, and a date. Force these into triples and youâre left picking which context to drop.
Enter hypergraphs â and a new generation of tools that make building them from raw text as simple as running a CLI command.
The Problem with Triples
Traditional knowledge graphs model facts as binary edges: (Aspirin) --treats--> (Headache). Clean. Simple. And incomplete.
What about the dosage? The patient population? The contraindication with blood thinners? The temporal constraint that it should be taken after meals? In a standard knowledge graph, youâd need to decompose this single medical fact into a constellation of auxiliary nodes and edges â reification â creating an explosion of synthetic structure that obscures the original semantics.
Hypergraphs solve this by allowing a single hyperedge to connect n entities simultaneously. That one drug interaction fact becomes a single hyperedge linking {Aspirin, Headache, 500mg, Post-meal, Contraindicated-with-Warfarin}. No information loss. No structural gymnastics.
This isnât just a theoretical nicety. The HyperGraphRAG paper (NeurIPS 2025) demonstrated that hypergraph-structured knowledge representation outperforms both standard RAG and previous graph-based RAG methods in answer accuracy, retrieval efficiency, and generation quality â tested across medicine, agriculture, computer science, and law.
Hyper-Extract: One Command, Eight Output Types
Hyper-Extract by Yifan Feng is an open-source framework (Apache 2.0, 350+ stars) that makes this entire pipeline accessible. It transforms unstructured documents into structured knowledge using LLMs, with support for eight distinct output types:
| Auto-Type | What It Produces |
|---|---|
| AutoModel | Pydantic-typed structured objects |
| AutoList | Ordered collections |
| AutoSet | Deduplicated collections |
| AutoGraph | Standard knowledge graphs |
| AutoHypergraph | Hypergraphs with n-ary relations |
| AutoTemporalGraph | Time-aware knowledge graphs |
| AutoSpatialGraph | Location-aware knowledge graphs |
| AutoSpatioTemporalGraph | Combined space-time graphs |
The spatio-temporal types are particularly interesting. A biographical text about a historical figure doesnât just contain who knew whom â it contains when and where those interactions happened. Hyper-Extract captures all of this natively.
Getting Started
Installation and extraction take three lines:
uv tool install hyperextract
he config init -k YOUR_OPENAI_API_KEY
he parse document.md -t general/biography_graph -o ./output/ -l en
That -t general/biography_graph flag points to one of 80+ declarative YAML templates spanning six domains: Finance, Legal, Medical, Traditional Chinese Medicine, Industry, and General. Each template defines the entity types, relation types, extraction guidelines, and display format â no code required.
Incremental Evolution
One of Hyper-Extractâs most practical features is incremental knowledge expansion. Extract from an initial document, then feed new documents to expand the graph:
he parse initial_report.md -t medical/diagnosis_graph -o ./patient_kg/
he feed ./patient_kg/ followup_notes.md
he feed ./patient_kg/ lab_results.md
he show ./patient_kg/
The knowledge graph grows over time, merging entities, resolving duplicates, and expanding relationships. This is how knowledge actually accumulates in practice â incrementally, from multiple sources, over time.
Under the Hood: 10+ Extraction Engines
Hyper-Extract isnât tied to a single extraction algorithm. It ships with integrations for:
- GraphRAG â Microsoftâs community-summarization approach
- LightRAG â Lightweight graph-based retrieval
- Hyper-RAG â Hypergraph-driven retrieval (from the same author)
- HypergraphRAG â N-ary fact extraction via hyperedges
- KG-Gen â Direct knowledge graph generation
- iText2KG â Iterative text-to-graph extraction
- Cog-RAG â Cognitive retrieval-augmented generation
You choose the method that fits your domain and quality requirements. The templates abstract over the engine selection, so switching between them is a configuration change, not a code rewrite.
The Research Ecosystem Behind It
Hyper-Extract doesnât exist in isolation. Itâs the practical toolkit sitting atop a rapidly maturing research ecosystem. Here are the key papers driving this space:
Hyper-RAG: Cutting Hallucinations with Hypergraphs
Hyper-RAG (also by Yifan Feng) tackles one of the biggest problems in production LLM systems: hallucinations. By structuring retrieved knowledge as hypergraphs that capture both pairwise and beyond-pairwise correlations, Hyper-RAG improved accuracy by an average of 12.3% over direct LLM use on a neurology dataset, and outperformed GraphRAG and LightRAG by 6.3% and 6.0% respectively.
Crucially, Hyper-RAG maintained stable performance as query complexity increased â while existing methods degraded. Its lightweight variant, Hyper-RAG-Lite, achieved 2Ă retrieval speed with a 3.3% performance boost over LightRAG.
HyperGraphRAG: N-ary Facts for Real-World Knowledge (NeurIPS 2025)
HyperGraphRAG formalizes the intuition that binary edges are insufficient. Each hyperedge in their system encodes an n-ary relational fact â a single atomic statement that can involve two, three, or more entities. The paper demonstrated consistent improvements over both standard RAG and graph-based RAG across four domains: medicine, agriculture, computer science, and law.
The NeurIPS 2025 acceptance signals that the community is taking hypergraph representations seriously as the next step beyond GraphRAG.
Hyper-KGGen: Learning to Extract Better
Hyper-KGGen (February 2026) addresses a subtler problem: generic extractors struggle with domain-specific jargon and conventions. Their solution is a skill-driven framework where the extractor actively learns domain expertise through an adaptive skill acquisition module.
A âcoarse-to-fineâ mechanism decomposes documents systematically, ensuring coverage from simple binary links to complex hyperedges. A stability-based feedback loop identifies where extraction is unstable and distills corrections into a Global Skill Library. The result: significantly better extraction quality than static few-shot approaches, especially across diverse domains.
The GraphRAG Reality Check
Itâs worth noting that the graph-based RAG landscape isnât uniformly positive. The GraphRAG-Bench paper (ICLR 2026) conducted a comprehensive analysis and found that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. The overhead of graph construction doesnât always pay off, particularly for straightforward factoid queries where simple chunk retrieval is sufficient.
This finding makes Hyper-Extractâs multi-engine approach especially valuable. Rather than betting everything on graph-based extraction, you can choose the right tool for the job â standard graphs where relationships matter, hypergraphs where n-ary facts dominate, spatio-temporal graphs where context is critical, or simple structured extraction where thatâs all you need.
Why Healthcare Needs Hypergraphs
The medical domain is where hypergraphs shine brightest, and itâs no coincidence that several of these papers use clinical data for evaluation.
Consider a typical clinical knowledge fragment:
Patient presents with Type 2 diabetes (diagnosed 2019), currently on Metformin 1000mg BID and Lisinopril 10mg daily. Recent HbA1c of 8.2% suggests inadequate glycemic control. Consider adding Empagliflozin 10mg, noting cardiovascular benefit in patients with established atherosclerotic disease.
A standard knowledge graph would need to create dozens of triples to represent this. A hypergraph captures it naturally:
- Hyperedge 1:
{Patient, T2D, 2019, Diagnosis}â the diagnostic fact with temporal context - Hyperedge 2:
{Metformin, 1000mg, BID, Current}â the medication with dosage and frequency - Hyperedge 3:
{HbA1c, 8.2%, Inadequate-control, Current}â the lab result with interpretation - Hyperedge 4:
{Empagliflozin, 10mg, CV-benefit, ASCVD, Consideration}â the recommendation with its qualifying condition
Each hyperedge is an atomic fact. No information is lost to reification. And when you query âWhat medications should be considered for a diabetic patient with cardiovascular disease?â, the retrieval system can walk the hyperedge structure to find exactly the right context â including the conditions under which the recommendation applies.
This is why Hyper-RAG showed its strongest results on medical datasets, and why HyperGraphRAGâs medicine evaluation was particularly compelling.
Where This Is Heading
The trajectory is clear: weâre moving from âextract triples from textâ to âextract the full semantic complexity from text.â Hyper-Extract sits at the practical end of this spectrum â a tool you can install today and use to build knowledge structures that were research prototypes a year ago.
A few things to watch:
- Template ecosystems â Hyper-Extractâs 80+ YAML templates hint at a future where domain-specific extraction is truly plug-and-play. Expect community-contributed templates to grow rapidly.
- Incremental knowledge bases â The
he feedpattern of continuously expanding knowledge from new documents is closer to how humans actually build understanding. This will become standard. - Engine selection â As more benchmark results emerge (like the GraphRAG-Bench findings), automatic engine selection based on query type and domain will become important. Hyper-Extractâs multi-engine architecture is well-positioned for this.
- Spatio-temporal reasoning â The AutoSpatialGraph and AutoSpatioTemporalGraph types are ahead of most extraction tools. As LLMs get better at temporal and spatial reasoning, these will unlock use cases in logistics, historical analysis, and epidemiology.
If youâre building RAG systems, knowledge bases, or any application that needs structured understanding of unstructured text, Hyper-Extract is worth a serious look. The combination of declarative templates, multiple extraction engines, hypergraph support, and incremental evolution makes it one of the most complete knowledge extraction frameworks available today.
uv tool install hyperextract
he parse your_documents/ -t medical/diagnosis_hypergraph -o ./knowledge/ -l en
he show ./knowledge/
Three commands. Full knowledge hypergraph. No excuses.