5 Dossier-Graph: Model Fine-Tuning for IQWiG Domain Knowledge

5.1 Overview

This guide provides comprehensive instructions for implementing the Dossier-Graph system, which creates a specialized AI assistant for Health Technology Assessment (HTA) tasks. The system fine-tunes large language models to understand and generate content following IQWiG (Institute for Quality and Efficiency in Health Care) standards.

5.1.1 System Requirements

Operating System: Ubuntu 24 LTS (recommended)
Memory: Minimum 16GB RAM
GPU: NVIDIA GPU with 8-12GB VRAM
Storage: 50GB available space
Python: Version 3.8 or higher

5.1.2 Core Capabilities

The fine-tuned model will be helpful in assisting to:

Generate HTA summaries following IQWiG methodology standards
Evaluate evidence quality using IQWiG terminology
Draft benefit assessments with appropriate regulatory language
Analyze cost-effectiveness in accessible terms
Provide methodological guidance based on IQWiG precedents
Process both German and English HTA content

5.2 Implementation Architecture

5.2.1 Technical Stack

Base Model: Mistral-7B-v0.1 (7 billion parameters)
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Quantization: 4-bit precision for memory efficiency
Framework: Hugging Face Transformers with PEFT

5.2.2 LoRA Configuration

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
    "gate_proj", "up_proj", "down_proj",     # MLP layers  
    "lm_head"                                 # Output projection
]

lora_config = {
    "r": 64,                # Rank for medical terminology complexity
    "alpha": 128,           # 2x rank ratio for training stability
    "dropout": 0.05,        # Lower dropout for specialized domain
    "bias": "none"          # Standard for causal language models
}

5.3 Data Preparation Pipeline

5.3.1 Step 1: Document Processing

Convert IQWiG reports into structured training data:

def process_hta_report(report_path):
    """Extract and structure content from IQWiG reports."""
    
    # Extract text from PDF
    text_content = extract_text_from_pdf(report_path)
    
    # Create instruction-based training examples
    training_example = {
        "instruction": "Summarize the benefit assessment findings",
        "input": f"IQWiG Report {report_id}: {text_content}",
        "output": "The assessment concluded..."
    }
    
    return format_as_jsonl(training_example)

5.3.2 Step 2: Training Data Structure

Each training example follows this JSONL format:

{
  "instruction": "Evaluate the evidence quality for pembrolizumab in melanoma",
  "input": "Single RCT, n=834 patients, 22-month follow-up, primary endpoint: PFS",
  "output": "The evidence quality is rated as moderate (GRADE). While the single RCT provides robust data with adequate sample size and follow-up duration, the limitation to a single study reduces confidence in the generalizability of findings."
}

5.3.3 Data Requirements

Task Type	Minimum Examples	Recommended
Evidence Assessment	200	500+
Benefit Rating	200	400+
Cost-Effectiveness Analysis	150	300+
Methodology Evaluation	100	250+

5.4 Model Training Process

5.4.1 Step 1: Environment Setup

# Install required packages
pip install torch transformers peft trl datasets accelerate bitsandbytes

# Load and compress base model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)

5.4.2 Step 2: LoRA Application

The LoRA method adds small, trainable matrices to specific model layers:

from peft import get_peft_model, LoraConfig

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    target_modules=target_modules,
    r=64,
    lora_alpha=128,
    lora_dropout=0.05
)

model = get_peft_model(model, peft_config)
# Trainable parameters: ~0.3% of original model

5.4.3 Step 3: Training Loop

training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=25,
    save_strategy="epoch",
    evaluation_strategy="steps",
    eval_steps=100
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    max_seq_length=2048
)

trainer.train()

5.5 Model Validation

5.5.1 Evaluation Metrics

5.5.1.1 1. Perplexity

What it measures: How “surprised” the model is by unseen HTA text - lower scores mean better prediction.

Rationale: Perplexity quantifies whether the model has genuinely learned HTA language patterns. If a model achieves low perplexity on held-out IQWiG reports, it demonstrates it can predict the next word in HTA contexts accurately. For example, after “The evidence quality is rated as”, a well-trained model should predict “moderate” or “high” (low perplexity), not random medical terms (high perplexity).

Why essential: It’s an objective, automated metric that catches overfitting - if perplexity is low on training data but high on validation data, the model memorized rather than learned.

5.5.1.2 2. ROUGE Scores

What it measures: Overlap between generated summaries and expert-written reference summaries (word/phrase matching).

Rationale: HTA reports require precise summarization of complex evidence. ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation) compare the model’s summaries against gold-standard IQWiG summaries, measuring whether key phrases like “considerable additional benefit” or “hint of lesser benefit” appear correctly. ROUGE-L specifically captures longest common sequences, ensuring the model maintains IQWiG’s logical flow.

Why essential: Validates that the model can distill lengthy clinical evidence into concise, accurate assessments matching IQWiG’s standardized summary style.

5.5.1.3 3. Domain-Specific Accuracy

What it measures: Correct usage of IQWiG-specific terminology and methodological concepts.

Rationale: IQWiG uses precise regulatory language with specific meanings. The model must distinguish between “proof” vs “indication” vs “hint” of benefit (Beleg/Hinweis/Anhaltspunkt in German), use GRADE terminology correctly, and apply the right benefit categories. This metric tests terminology on a curated test set - for instance, given specific evidence, does the model correctly classify it as “major” vs “considerable” additional benefit?

Why essential: Generic medical language isn’t sufficient for regulatory submissions. Misusing terms like “significant benefit” instead of IQWiG’s exact categories could invalidate an assessment.

5.5.1.4 4. Human Expert Review

What it measures: Qualitative assessment of readability, logical coherence, and regulatory appropriateness.

Rationale: Automated metrics miss critical nuances that HTA professionals immediately recognize. Experts evaluate whether the model’s outputs would be acceptable in actual IQWiG proceedings - checking for logical argumentation, appropriate hedging language, correct interpretation of statistical significance vs clinical relevance, and adherence to IQWiG’s methodological standards that aren’t captured by word matching.

Why essential: The ultimate test is whether HTA professionals would trust and use the output. Expert review catches dangerous errors (like misinterpreting non-inferiority trials) that automated metrics might miss.

5.5.1.5 Combined Rationale

These four metrics create a comprehensive evaluation framework: - Perplexity ensures fundamental language modeling capability - ROUGE validates content accuracy and completeness
- Domain Accuracy confirms regulatory compliance - Expert Review provides the final quality gate

Together, they prevent deploying a model that seems statistically competent but fails in real HTA contexts - a critical safety requirement for regulatory healthcare applications.

5.5.2 Validation Strategy

# Split data: 80% training, 20% validation
train_reports = reports[:int(0.8 * len(reports))]
val_reports = reports[int(0.8 * len(reports)):]

# Test on unseen reports
validation_results = model.evaluate(val_reports)

if validation_results['perplexity'] > threshold:
    # Consider increasing training data or adjusting hyperparameters
    optimize_training_parameters()

5.6 Deployment and Inference

5.6.1 Model Usage

def generate_hta_response(question, context):
    """Generate HTA-compliant responses."""
    
    # Format input following training schema
    prompt = f"[INST] {question}\n\nContext: {context} [/INST]"
    
    # Tokenize and generate
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.95
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

5.6.2 Example Applications

# Evidence quality assessment
response = generate_hta_response(
    question="Assess the evidence quality for this intervention",
    context="Two RCTs (n=1,245 total), 12-month follow-up, consistent findings"
)

# Benefit summary generation
response = generate_hta_response(
    question="Summarize the additional benefit assessment",
    context="IQWiG Report A23-42: Primary endpoint met, QoL improved..."
)

5.7 Critical Success Factors

5.7.1 Data Quality

Consistency: Ensure uniform terminology across training examples
Completeness: Include all relevant IQWiG assessment types
Accuracy: Verify expert annotations before training

5.7.2 Technical Considerations

Memory Management: 4-bit quantization reduces VRAM from 28GB to 7GB
Training Stability: Gradient accumulation enables larger effective batch sizes
Overfitting Prevention: Regular validation on held-out reports

5.7.3 Performance Optimization

Parameter	Impact	Recommended Value
Learning Rate	Training stability	2e-4 to 5e-4
LoRA Rank (r)	Model capacity	32-64 for medical domain
Batch Size	Memory usage	4-8 with gradient accumulation
Training Epochs	Model performance	3-5 (monitor validation loss)

5.8 Mathematical Foundation

5.8.1 LoRA Mechanism

Instead of updating all model parameters, LoRA introduces low-rank decomposition:

\[W_{new} = W_{original} + BA\]

Where: - \(W_{original}\) ∈ ℝ^{d×k} (frozen pre-trained weights) - \(B\) ∈ ℝ^{d×r} and \(A\) ∈ ℝ^{r×k} (trainable matrices) - \(r << min(d, k)\) (rank constraint)

This reduces trainable parameters from O(dk) to O(r(d+k)), typically a 99.7% reduction.

5.8.2 Training Objective

The model optimizes cross-entropy loss over HTA-specific tokens:

\[L = -\sum_{i=1}^{N} \log P(y_i | x_{<i}, \theta_{base} + \Delta\theta_{LoRA})\]

Where \(\Delta\theta_{LoRA}\) represents the small parameter updates learned from IQWiG data.

5.9 Installation Guide

5.9.1 Quick Start

Do not run: work in progress!

# Clone repository
git clone https://github.com/your-org/dossier-graph.git
cd dossier-graph

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run training pipeline
python train_mistral_lora_hta.py \
    --data_path ./data/iqwig_reports \
    --output_dir ./models/hta_mistral \
    --num_epochs 3

5.9.2 Production Deployment

For production environments, consider:

API Service: Deploy model behind REST API for scalability
Batch Processing: Implement queue system for high-volume requests
Monitoring: Track inference latency and model performance metrics
Version Control: Maintain model versioning for reproducibility

5.10 Troubleshooting

5.10.1 Common Issues

Issue	Cause	Solution
Out of Memory	Large batch size	Reduce batch size or increase gradient accumulation
Poor Performance	Insufficient data	Increase training examples or use data augmentation
Overfitting	Too many epochs	Implement early stopping based on validation loss
Slow Training	Inefficient settings	Enable mixed precision training (fp16)

5.11 Conclusion

The Dossier-Graph system enables organizations to create specialized AI assistants that understand and generate HTA content following IQWiG standards. By leveraging LoRA fine-tuning on carefully curated datasets, the system achieves domain expertise while maintaining computational efficiency.

5.11.1 Next Steps

Collect and prepare IQWiG reports for training
Set up development environment following this guide
Train initial model and evaluate performance
Iterate based on domain expert feedback
Deploy for production use

5.12 References

Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
IQWiG Methods Papers and Guidelines: www.iqwig.de
Hugging Face PEFT Documentation: https://huggingface.co/docs/peft

This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page