Evaluation plan Hold out 200 examples per adapter Metrics: JSON format compliance rate, answer accuracy, BLEU score for explanations Run base model vs fine-tuned model on same test set Document ...