The paper elaborates methodologies for LLM text extraction from research papers. It is not only giving methodologies for extraction of meaningful data, but also assessment and choosing right models.

1. Introduction

In healthcare research, to extract meaningful text from papers is manually done. Valuable information may exist in many papers but it only is dependent on labor-intensive manual review.

There seemed to have been difficulties in generalizing data into structured and meaningful usage when utilizing traditional methods such as narrowing text scopes and key word matching etc. until LLM emerged (well now LLM is a well-known technology and needs not further explanation)

2. Methods

Narrative review
PubMed/MEDLINE and arXiv
- keywords: LLM, named entity recognition, information extraction, RAG etc.

3. Rule-Based and Statistical learning

Pretty much shows the evolution of data extraction from studies

LLM is capable of reading contexts throughout words and sentences in a huge context window, which clearly is an advantage in such tasks like text extraction.

4. LLM Extraction Mechanisms

There are mechanisms that we could leverage for extraction using LLMs

Prompting
RAG
Schema Constraints
Parameter-efficient fine-tuning (PEFT)
Automatic prompt optimization (APO)

The pipeline runs first programmatically, then when the pipeline fails extraction, human adjudication is required

5. Evaluation

Multi-dimensional and does not only rely on accuracy.

task accuracy
structured-output quality
human-in-the-loop
stability/reliability
regulatory compliance

There are evaluation methods that could be numerically assessed. Named Entity Recognition (NER) and format extraction could be evaluated with precision, recall and F1 score.

In a table the paper represents all multi-dimensional evaluation methodologies as well as cons&pros for each method. Form the table I might be able to state that we (in our project SOALES) will cover the majority of the methods as highlighted below.

5-1. Structured Output Quality

Parsing success rate (PSR): NER or object values are as accurate as expected
Field completeness rate (FCR): whether the required fields are not empty
Semantic consistency
Normalization accuracy

5-2. Practical Utility and Cost

human–machine collaboration costs

Human
- Manual review time
- Rework rate etc.
GPU
- GPU token cost calculation
- Virtual Machine on-demand cost calculation
- Model iteration time, extraction time etc.

5-3. Stability and Reliability

For this evaluation, Jaccard score and confidence&abstention policy could be utilized to express in numerics.

Jaccard Score

$$J(A|B) = {|A\cap B|\over|A\cup B|}$$

A = {John, 1999}
B = {John, 1999, San Fransisco}

# Jaccard
2/3 = 0.6667

Confidence and abstention policy
- LLM will give up generating the value that is below the thereshold e.g. 0.5, then leaves a note that human review is required

// without abstention
{
  "name": "John Smith",
  "confidence": "0.32",
}

// abstention
{
  "name": null,
  "confidence": "0.32",
  "remark": "Human review required",
}

5-4. Evaluation and Reporting Standards

Here are some reporting standards to disclose our research and experiments are fairly conducted.

internal and external validation
interannotator agreement: cohen's kappa (k)
Data leakage safe guards
bootstrap 95% confidence interval

F1 = 0.89
95% CI [0.87, 0.91]

Random seeds etc.

6. Applied AI

It can be automated to identify diagnosis
- LLM can automatically recognize diagnostic terminology

7. Challenges

Despite the systematic pipeline using LLMs, residual system-level risks still exist

Algorithmic Dimension: Semantic Ambiguity, Knowledge Staleness, and Hallucinations can cause errors
- distorted reasoning throughout pipelines
- hallucination
- outdated knowledge
Data Dimension: Terminological Synonymy, Format Heterogeneity, and Annotation Scarcity
- Across institutions and facilities, terminology and generalized schema differences may confuse the system for data extraction

RAGs and schema-constrained generation techniques could mitigate such risks but algorithm is still sensitive to out-of-distribution inputs and changes in domain-specific terminologies

Reflection

Automation in text and data extraction using LLM can contribute to cost and time waste in diverse research areas.

However, the error does not diminish drastically to near zero, which remains to an uncertainty. Recent studies show the accuracy of LLM outputs range from 0.7 to 0.9, which reflects that there is still a colossal possibility that the AI can make mistakes and humans are required to correct them.

To mitigate errors and adapt the domain shifts using a sole pipeline, it is necessary to build an automation for tuning/training LLMs in a specific domain. It may require a colossal amount of time and cost for tuning the entire open source LLMs, however, QLoRA tuning and having multiple adaptors for a single model could be an efficient and a quick solution. Yet, challenges on this still remain as it also requires evaluation and many heuristic experiments

Reference

[1] Chen, L., He, R., Lu, P., Jin, Y., Zhou, L., Li, N., Wu, P., & Hu, B. (2026). Operationalizing large language models for clinical research data extraction: Methods, quality control, and governance. Journal of Medical Systems, 50(1), 25. https://doi.org/10.1007/s10916-026-02353-w

[Paper Review] Operationalizing Large Language Models for Clinical Research Data Extraction: Methods, Quality Control, and Governance

1. Introduction

2. Methods

3. Rule-Based and Statistical learning

4. LLM Extraction Mechanisms

5. Evaluation

5-1. Structured Output Quality

5-2. Practical Utility and Cost

5-3. Stability and Reliability

5-4. Evaluation and Reporting Standards

6. Applied AI

7. Challenges

Reflection

Reference

Comments

Paper Review

[Paper Review] DeepSeek-R1 Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

More from this blog

[Paper Review] Development of a machine learning approach for prediction of red blood cell transfusion in patients undergoing Cesarean section at a single institution

[Paper Review] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

[Paper Review] Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis

[Paper Review] Training a Helpful and Harmless Assistant withReinforcement Learning from Human Feedback

Command Palette

1. Introduction

2. Methods

3. Rule-Based and Statistical learning

4. LLM Extraction Mechanisms

5. Evaluation

5-1. Structured Output Quality

5-2. Practical Utility and Cost

5-3. Stability and Reliability

5-4. Evaluation and Reporting Standards

6. Applied AI

7. Challenges

Reflection

Reference

Comments

Paper Review

[Paper Review] DeepSeek-R1 Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

More from this blog