Two working papers on whether inference-time self-refinement can improve citation faithfulness in retrieval-augmented generation when the backbone model is small ($\leq 4$B parameters). Phase 1 studies a cheap lexical heuristic; the extended paper adds token-level mechanistic attribution and a cross-backbone replication on Llama-3.2-3B. Evaluated on ALCE-ASQA and GaRAGe with Qwen3-4B. Both papers are shareable pre-prints; the extended version is a work in progress.
Compares six refinement conditions on Qwen3-4B over ALCE-ASQA (5 retrieved passages, mostly relevant) and GaRAGe (15 retrieved passages, mixed relevance). Headline findings: (1) Self-Refine degrades citation precision on clean retrieval by $9.37$ NLI points ($p<0.01$); (2) a twenty-line passage-overlap heuristic beats Self-Refine on both benchmarks and beats the no-refinement baseline on GaRAGe by $+7.09$ RFCP points; (3) STR-EM is blind to all of this. Includes a deployment-oriented decision tree and an AML worked example.
Extends Phase 1 with three additions. (a) An attempt to replace the lexical heuristic with token-level DynamicLRP attribution, its failure on Qwen3-4B, and a pivot to plain input$\times$gradient. (b) A critique-prompt ablation identifying the "binary-trigger insight": attribution's value is in \emph{when} to refine, not \emph{what} to criticize. (c) A cross-backbone replication on Llama-3.2-3B using LXT v2.1's AttnLRP, which reproduces the Self-Refine advantage directionally but not the beats-baseline advantage (Qwen3-specific). Honest negative results documented throughout.