Hybrid Retrieval-Augmented Generation Agent for Trustworthy Legal Question Answering in Judicial Forensics

AI hallucinations in legal contexts can lead to serious consequences, making accuracy and traceability essential. This study presents a hybrid system that combines retrieval and multiple AI models to deliver verifiable, citation-backed answers that adapt as laws change. It marks a step toward safer, more sustainable legal AI capable of powering trustworthy client consultations and judicial tools.

Details

Author(s)
Yueqing Xi, Yifan Bai, Huasen Luo, Weiliang Wen, Hui Liu, Haoliang Li
Date
November 3, 2025
Summary
Researchers from City University of Hong Kong and Tianjin University developed a hybrid legal question-answering system designed to reduce AI hallucinations in judicial forensics applications. The system combines retrieval-augmented generation (RAG) with multi-model ensembling to deliver trustworthy legal guidance. When a user query matches existing entries in the knowledge base (similarity threshold of 0.6), the system generates answers through RAG using verified legal sources. When no match is found, three different language models (ChatGPT-4o, Qwen3-235B-A22B, and DeepSeek-v3.1) generate candidate answers, which are then scored by a specialized selector model across five dimensions: correctness, legality, completeness, clarity, and faithfulness.

Testing on the LawQA dataset containing 16,182 Chinese legal question-answer pairs showed the hybrid approach significantly outperformed both single-model baselines and standard RAG implementations. The hybrid model using DeepSeek-v3.1 achieved the highest scores with F1 of 0.3612, ROUGE-L of 0.2588, and LLM-as-a-Judge rating of 0.954. A critical feature is the dynamic knowledge base updating mechanism where high-quality answers undergo human review before being written back to the repository, allowing the system to adapt to frequently changing statutes and case law. The research demonstrates practical value for judicial scenarios requiring high accuracy, traceability, and continuous adaptation to evolving legal frameworks.
Key Takeaways
1. Hybrid approach significantly reduces AI hallucinations in legal contexts. By combining retrieval-augmented generation with multi-model ensembling, the system achieved measurably better performance than single models or standard RAG implementations. The best configuration improved F1 scores by approximately 7% and LLM-as-a-Judge ratings by 2.1% compared to baseline models.

2. Retrieval prioritization over generation improves legal accuracy and traceability. The system first searches a verified knowledge base before generating answers. When retrieval succeeds (similarity above 0.6 threshold), answers are grounded in validated legal sources with explicit statutory citations. This approach directly addresses the authenticity and compliance requirements critical in judicial settings.

3. Dynamic knowledge base updating solves the stale information problem. High-quality generated answers undergo human review before being written back to the repository, allowing the system to adapt to frequently changing statutes and case law. This mechanism addresses a fundamental limitation of static legal AI systems that struggle to keep pace with evolving legal frameworks.
Why it matters?
AI hallucinations in legal contexts can lead to serious judicial misjudgments and unfair outcomes, making this a high-stakes application where accuracy is non-negotiable. This research bridges the gap between powerful generative AI capabilities and the stringent reliability, traceability, and compliance requirements demanded by judicial systems. The dynamic knowledge base updating mechanism solves a critical obstacle that has prevented widespread AI deployment in legal settings, where laws and precedents change frequently and static systems quickly become outdated and potentially dangerous.
Practical Implications
1. Law firms can deploy AI for client consultations with significantly lower liability risk. The hybrid system’s measurable reductions in hallucination risk and accuracy gains (around 7.8 percent F1 improvement and 2.1 percent Judge rating increase) combined with explicit statutory citations make AI-generated legal advice more defensible and trustworthy, opening opportunities for scalable legal consultation services that were previously too risky.

2. The human-review-then-update workflow creates a sustainable model for maintaining legal AI systems. Rather than requiring constant manual retraining or costly data updates, the system learns from real usage as reviewers approve high-quality answers that are written back to the knowledge base. This provides a practical path for keeping legal AI current with evolving laws without excessive ongoing expense.

3. Multi-model ensembling offers a blueprint for critical legal AI deployments. Instead of relying on a single AI model, the study shows that running multiple models with a selector mechanism delivers consistent improvements of roughly 1 to 8 percent in F1 scores and up to 3 percent in Judge ratings depending on the model backbone. This approach is particularly valuable for law firms and legal tech companies where accuracy cannot be compromised, justifying the added computational cost.
Citation
Xi, Y., Bai, Y., Luo, H., Wen, W., Liu, H., & Li, H. (2025). Hybrid retrieval-augmented generation agent for trustworthy legal question answering in judicial forensics. arXiv preprint arXiv:2511.01668v1.
Publication
arXiv