I'm an AML model developer with about five years at Scotiabank, where I've led end-to-end development and productionization of transaction monitoring models — everything from insider threat detection to cartel-linked financial flows in Mexico. My work sits at the intersection of applied ML, regulatory compliance, and stakeholder management: I've worked directly with FIU investigators, gone through full MVA cycles, and navigated Privacy, Legal, and IT to get models into production.
What pulls me toward consulting at this point is the breadth problem. At a bank you go deep in one institution's data and one regulator's framework. I've been researching cross-jurisdictional AML patterns — LATAM, Canada, Mexico — and I want to work across the industry rather than optimize one firm's detection stack. Deloitte's FE&M group is where I can bring everything I've built and also keep growing across client types and problem spaces.
There are two distinct things I'm after. First, problem diversity — consulting means exposure to clients at different maturity levels, different TM platforms, different typology focuses. That kind of breadth in a short time is hard to replicate in-house. Second, Deloitte specifically sits at the intersection of regulatory advisory and technical delivery, which is exactly where AML is heading. The firms doing the most interesting work on model governance, agentic architectures for compliance, and cross-border detection are the Big 4, not individual banks.
From a career standpoint, I've reached a point at Scotiabank where I'm defining strategy as much as executing it. Consulting at Senior Consultant level is a natural next step where I can start advising clients on the same decisions I've been making internally — but across many institutions simultaneously.
I want to be careful to frame this honestly — I'm not leaving because anything is wrong. Scotiabank gave me a tremendous platform. But five years in, the growth question I'm asking myself is: am I deepening one institution's capabilities, or broadening how the industry approaches the problem? I've been doing research in LLM-based AML detection, agentic architectures, graph-based entity surfacing — and I want to see those applied across the industry, not just in one bank's risk stack.
Consulting lets me pursue that without sacrificing domain depth, which I think is rare. Most AML consultants come from either compliance or data science — I've done both in parallel, and that's where I can add the most value at Deloitte.
In three to five years I'd like to be a recognized expert at the intersection of AML and applied AI — advising clients not just on model mechanics but on governance, regulatory defensibility, and how to architect AI-enabled compliance programs. At Deloitte that maps to Manager or Senior Manager level, where I'm leading engagements and mentoring the next generation of AML analysts.
Longer term, I'm genuinely interested in contributing to how regulators think about AI in financial crime — whether that's through published research, working with FINTRAC, or contributing to Wolfsberg working groups. Deloitte's advisory relationships are probably one of the best platforms for that kind of influence.
Situation: At Scotiabank I was simultaneously leading the Dominican Republic model development as part of the LATAM initiative, maintaining the Student and Internal Monitoring models already in production, and supporting a FINTRAC audit readiness exercise — all with a team of two analysts and three co-ops.
Task: I needed to keep production models running without incident, deliver the DR model on a regulatory deadline, and produce audit documentation in parallel.
Action: I created a project allocation framework using my Git analytics dashboard — which I'd built to track commits by project and contributor — to give me a real-time view of where effort was going. I triaged ruthlessly: audit documentation got prioritized because it was regulatory-driven and immovable. For the DR model, I front-loaded data exploration so my co-ops could run scenarios independently while I focused on documentation. I also had a weekly sync with the Director to surface blockers early rather than absorbing them.
Result: We hit all three deadlines. The audit documentation passed MVA review without major findings, the DR model delivered its first detection run on schedule, and neither production model missed a batch submission.
Situation: The Internal Monitoring model — our insider threat detection — used generative AI and LLM-based features on employee transaction and HR data. I had to present it to our Legal and Privacy teams for PIA approval, and they had no ML background.
Task: Get sign-off from stakeholders who were skeptical of AI and sensitive to employee data use.
Action: I stripped out all ML jargon and reframed the model around regulatory intent — what suspicious behaviors it was designed to catch, mapped directly to FINTRAC typologies. I walked through three real anonymized case examples showing inputs and outputs without ever using terms like "embedding" or "logit." For the LRP-based explainability layer, I showed them a feature attribution heatmap and said "this is why the model flagged this employee — these six transactions, these HR fields" rather than explaining how LRP works mathematically.
Result: PIA approved on first submission. Privacy lead later told me it was the clearest AI model documentation she'd reviewed.
Situation: Early in my ABAC model development, I underestimated the data quality issues in the third-party supplier payment data — specifically that vendor names were inconsistently formatted across source systems, which broke my NLP name-matching logic.
Task: The model was supposed to go into MVA review, and the timeline slipped by about six weeks while we fixed upstream data issues.
Action: I was too optimistic in my initial data profiling and hadn't stress-tested edge cases. After that experience, I built a formal data readiness checklist into my model development process — completeness, consistency, entity resolution validation — before committing to any timeline. I also started scheduling a "data spike" phase explicitly in project plans so stakeholders understood it as a real workstream, not just prep.
Result: Every model I've led since has shipped without major data quality rework. The ABAC model did eventually go fully live and passed MVA.
Situation: Our IT deployment team had chosen an architecture for the LATAM model that I believed would create batch latency issues in production — but IT owned the technical design decisions.
Task: Change the approach without the authority to mandate it.
Action: I built a quick simulation showing estimated batch times under the proposed architecture versus an alternative partitioning approach — using actual transaction volumes from the Mexico jurisdiction. I framed it around regulatory risk: if batch runs exceed FINTRAC's STR reporting window, that's a compliance gap, not just a performance issue. That reframing made it an IT risk conversation rather than a design preference disagreement. I presented it jointly to the IT lead and the Director rather than going over anyone's head.
Result: The architecture was revised before deployment. The IT lead later said it was the first time a model team had brought them a production risk argument supported by numbers.
Situation: When I took on the Student Monitoring Model, the FIU investigators who would use the output were skeptical of AI-generated alerts — they'd had bad experiences with high false positive rates from legacy rule-based systems.
Task: Build enough trust that investigators would actually act on model-generated alerts rather than dismissing them.
Action: I ran a shadow period — for the first two months, I sent investigators the same cases my model flagged alongside their existing caseload, asked for feedback on every alert, and showed them the feature attribution explaining why each case was flagged. I didn't ask them to change their workflow; I asked them to tell me when I was wrong. Then I iterated on the model based on their feedback, shared the updated metrics, and let them see the improvement firsthand.
Result: By month three, investigators were proactively requesting model output ahead of their review cycles. The model now runs a 70–85% STR conversion rate. They trust it because they helped shape it.
Situation: When the Mexico jurisdiction drug trafficking model was scoped, the ask was essentially "build an AI/ML capability where only rules exist today" — no defined features, no labeled typology data, and limited local institutional knowledge about cartel financial patterns.
Task: Turn that open brief into a production-ready model.
Action: I started with a typology research phase — working with external domain experts who had investigative backgrounds in the Mexico market, reviewing FATF and FinCEN guidance on cartel-linked transactions, and running exploratory analysis on cross-border wire and cash patterns. That let me formulate a detection hypothesis before touching any model design. I then ran the design through the Director and compliance SMEs before committing to feature engineering, so we weren't building in the wrong direction. The entire scoping phase took four weeks before a single model was trained.
Result: First ML detection capability for the Mexico jurisdiction, deployed into production as part of the LATAM initiative.
Situation: After my first MVA submission — early in the ABAC model cycle — the validation team came back with a finding that my model documentation lacked sufficient discussion of model risk and limitations. It was accurate feedback; I'd focused the documentation heavily on methodology and performance.
Task: Address the finding and rebuild my documentation standard going forward.
Action: I sat down with the lead validator to understand exactly what they expected — not just for this model, but what a gold-standard submission looks like. I rebuilt the risk and limitations section with explicit coverage of: data quality risks, distributional shift risks, adversarial gaming scenarios, and regulatory interpretation uncertainty. I also created a documentation template with that structure baked in, which I've used for every model since.
Result: The revised submission passed. More importantly, the template has shortened our MVA cycles significantly — subsequent models have had fewer resubmissions.
I'll walk through the Internal Monitoring Model since it's the most technically rich. The business problem was insider threat — employees potentially using access and position to facilitate or ignore money laundering.
Data: We pulled transaction data, KYC event data, and HR data — performance records, access logs, role history. Linking these without exposing unnecessary PII required careful data governance scoping, which I worked through with Privacy.
Feature engineering: I designed behavioral features at multiple temporal granularities — daily, weekly, rolling 90-day. Key signals included unusual manual overrides, transaction velocity changes relative to peer groups, and access pattern anomalies. I also used LLM-based features: prompt engineering on transaction narratives and HR notes to surface semantic risk signals that rule-based logic would miss.
Model: Anomaly scoring using an ensemble approach, with LRP-based attribution to explain why each alert was generated — which was a regulatory requirement for FIU investigators.
Validation and production: Full MVA cycle, fairness validation, PIA approval. Deployed through Oracle FCC Studio. Monthly batch, with continuous feedback loops from SIU investigators that feed into feature updates.
Result: 70–85% STR conversion rate, in active production use.
AML is one of the most extreme imbalance problems in applied ML — true positives can be 1-in-10,000 or worse, and the labeling problem is compounded by the fact that "clean" doesn't mean innocent. So I think about this in layers.
First, I prefer reframing as anomaly detection where possible, which sidesteps needing positive labels entirely. For supervised approaches, I use a combination of: cost-sensitive learning (asymmetric misclassification penalties calibrated to investigation capacity), careful threshold selection optimized on precision-recall rather than accuracy, and SMOTE or other synthetic oversampling only after I understand whether the minority class is well-defined enough for interpolation to be meaningful.
More importantly, I'm skeptical of standard imbalance fixes in AML because the minority class is heterogeneous — one model's typology space can look like several distinct distributions. I often build typology-specific sub-models rather than one general model, which reduces within-class variance and makes the imbalance problem more tractable.
Evaluation metric: I never report accuracy. Precision at fixed alert volumes, recall at a given FPR, and lift curves are the right measures for an investigation workflow context.
The naive answer is "false negatives are worse because we miss money laundering." The honest answer is more nuanced.
False negatives carry regulatory risk — missing a case that FINTRAC later flags, or a pattern that an enforcement action reveals. False positives carry operational risk — analyst burnout, review backlogs, and ultimately lower detection quality because overwhelmed investigators start dismissing alerts less carefully.
In practice I calibrate thresholds against investigation capacity first. If the FIU team can review 500 alerts per batch cycle, I optimize for the best recall achievable at that alert volume, not the theoretical recall maximum. That's why I work so closely with FIU teams — they tell me where their review quality starts degrading, and that sets my operating point.
For high-stakes typologies — like cartel financing or insider threat — I'll accept more false positives. For lower-risk, high-volume typologies, I'll optimize for precision. It's a deliberate, typology-specific calibration, not a single model-wide threshold.
I've prepared documentation for multiple independent MVA reviews at Scotiabank, and I treat it as a core deliverable from day one of model development, not an afterthought.
A strong MVA package needs: complete model conceptual soundness documentation — why this approach, why these features, what assumptions were made; data quality and lineage documentation; in-sample and out-of-sample performance metrics; sensitivity analysis; model risk and limitation analysis; fairness validation results if the model touches protected attributes; and an ongoing monitoring plan with defined thresholds for model refresh triggers.
The biggest trap developers fall into is under-documenting limitations — validators will find weaknesses and if you haven't disclosed them, it looks like you didn't know or were hiding them. I include an explicit "known limitations and mitigations" section in every submission, which tends to accelerate review because validators can see I've already thought through the failure modes.
At Deloitte in a consulting context, I'd expect to support clients through both sides: helping them build MVA-ready documentation and helping validation teams structure their review frameworks.
Rules are interpretable, auditable, and directly mappable to regulatory typologies — a regulator can read a rule and understand what it's catching. They're also fragile: known-pattern detection only, gameable by sophisticated actors, and they generate a lot of false positives at the thresholds needed for meaningful recall.
ML models can detect novel or complex patterns, adapt to behavioral shifts over time, and significantly reduce false positives at equivalent recall. But they carry model risk, require ongoing validation, and have explainability challenges that regulators increasingly scrutinize.
My recommendation is always a hybrid architecture. Rules as a foundational layer for typology coverage that regulators expect to see — they're your documented audit trail. ML as a supplementary detection layer that catches behavioral anomalies the rules miss, with explainability tooling so investigators can see why a case was flagged. In a client engagement I'd assess their maturity, regulatory posture, and investigation capacity before recommending how aggressively to shift the balance toward ML.
My graph work spans a few different contexts. At RBC as a co-op, I built a graph-based anomaly detection model on the EMT network, using community detection to surface money laundering clusters — that was my first exposure to transaction graph analysis.
At Scotiabank I work on graph-based entity surfacing for the FIU — using NetworkX and PySpark to build entity resolution graphs across client accounts, counterparties, and beneficial ownership structures. The goal is surfacing shell company structures and layering patterns that are invisible at the transaction level.
My current research — which I'm writing up as AMLReason — combines GNNs with LLM reasoning over transaction subgraphs. The idea is that a GNN learns structural embeddings of the local transaction neighborhood, and an LLM then performs chain-of-thought reasoning over that context to generate an explainable risk narrative. It's a post-screening step that supports FIU triage rather than replacing the primary model.
The core technical challenge in AML graph work is label sparsity and the clean-label assumption problem. Most graph ML benchmarks assume clean binary labels, which doesn't hold in real bank data where suspicious and clean transactions are often co-mingled.
Explainability isn't one thing — it means different things to different audiences, and you need to design for each. For investigators, I care about local explanations: why was this specific transaction flagged, which features drove the score? For validators and regulators, I care about global model behavior: what patterns does the model use, does it exhibit bias, is it stable across population segments? For Legal and compliance, I care about conceptual soundness: does the model's logic map to known typologies?
Technically, I've used LRP-based feature attribution, SHAP, and attention weight analysis for LLM-based features. For regulatory defensibility specifically, I design features that are interpretable from the start — behavioral ratios, peer-group comparisons, time-series deltas — rather than opaque embeddings where I'd have to work backwards to explain them.
My current research on LRP-based attribution for anomaly detection is specifically motivated by this — building explainability into the architecture rather than bolting it on post-hoc, which is what OSFI's E-23 guidance is pushing toward for model risk management.
OSFI E-23 is the Office of the Superintendent's guideline on model risk management — it establishes expectations for model development, validation, and ongoing performance monitoring. The core framework is the three-tiered model risk classification — models are rated by risk level, which determines the depth of validation and governance required.
In practice it shaped my work in two ways. First, every model I build at Scotiabank has to be classified under E-23 before validation scope is determined — our ABAC and internal monitoring models are both high-risk-tier, which means full independent MVA with challenger model testing. Second, E-23's guidance on model limitations documentation directly influenced the template I built — the guideline specifically calls out the need to document known limitations, assumptions, and compensating controls.
The emerging guidance on AI and ML models under E-23 is also directly relevant to my research — specifically the expectation around explainability and ongoing monitoring for distributional shift, which is one of the design requirements I've built into my anomaly detection framework.
FINTRAC requires Suspicious Transaction Reports to be filed within 30 days of a reporting entity forming reasonable grounds to suspect a transaction is related to money laundering or terrorist financing — or within three days for cases involving terrorist property.
The "reasonable grounds to suspect" standard is lower than "reasonable grounds to believe" — which is intentional. It means institutions should file when there's a reasonable basis for suspicion, not certainty. That matters for model design: our alert threshold isn't "is this definitely laundering?" but "does this warrant an investigation that could form the basis of an STR?"
Constraints on model design include: batch timing — our monthly batch cycles must leave investigators enough time to complete their review and file within the 30-day window, which drove the IT architecture discussion I mentioned earlier. Record retention requirements mean we log every model input and output for auditability. And the narrative quality of STRs — investigators need clear, articulable reasons to include in the report, which is exactly why LLM-based narrative generation is a priority for our FIU team.
The Wolfsberg Group is a consortium of major global banks that publishes guidance on financial crime risk management — their Principles are industry best practice, not regulatory mandates, but they carry significant weight with regulators and auditors because they represent collective bank expertise.
For AML model design, the most relevant Wolfsberg guidance covers: the risk-based approach — that transaction monitoring should be calibrated to the institution's specific risk profile, not a generic rule set; typology coverage — models should be grounded in known financial crime methods; and effectiveness measurement — institutions should be able to demonstrate that their monitoring program is actually detecting suspicious activity, not just generating alerts.
That last point directly shaped how I report model performance. I report not just alert volume but STR conversion rate and, where possible, SAR quality metrics from investigators. "We generated 500 alerts" is a compliance statement. "70–85% of alerts resulted in an STR" is an effectiveness statement — that's what Wolfsberg is pushing for, and that's what Deloitte's clients increasingly need to demonstrate to their regulators.
ABAC sits at the intersection of AML and sanctions compliance — the typologies overlap but the data sources are different. Where standard AML TM focuses on customer transaction behavior, ABAC analysis focuses on third-party payments: supplier invoices, vendor disbursements, consulting fees, and other business-to-business flows that can disguise bribery of government officials or business partners.
The red flags I designed the Scotiabank ABAC model to detect included: invoice amounts that are round numbers or just below approval thresholds (structuring), payments to vendors with no verifiable business presence, payments to shell companies in high-risk jurisdictions, unusual timing patterns relative to procurement cycles, and keyword patterns in transaction narratives — terms like "facilitation," "agent fees," or "consulting" in high-risk contexts.
The NLP component — name matching and keyword search in SparkSQL and Python — was particularly important for surfacing Politically Exposed Person connections in vendor relationships, where a supplier might be a shell owned by a PEP relative.
It's a compliance-side model, so the stakeholders are different from standard AML — Legal and Compliance own it rather than FIU, which changed how I documented and reported results.
SR 11-7 is the Federal Reserve's guidance on model risk management — the US equivalent of E-23, and in many ways its template. It established the foundational three-pillar framework: model development, documentation, and validation; independent model validation; and ongoing monitoring. It's become the de facto global standard and heavily influenced how OSFI structured E-23.
Key differences: SR 11-7 is US bank regulation, so it applies to Fed-supervised institutions. E-23 is OSFI guidance for Canadian federally regulated financial institutions. In practice, the conceptual frameworks are closely aligned, but E-23 incorporates more recent thinking on AI and machine learning governance — which makes sense given that E-23 was updated more recently.
My direct experience is with E-23, but the principles I apply — model tiering, independent validation, limitations documentation, ongoing monitoring — are fully compatible with SR 11-7. For Deloitte clients that are cross-border or US-based, I'd be able to map between the frameworks without issue.
First, I try to understand what's actually driving the pushback. Clients push back for different reasons: they think the recommendation is wrong on the merits; they're worried about internal political implications; they have a budget or resource constraint they haven't disclosed; or they feel the recommendation doesn't account for their specific operating context.
My instinct is to ask questions rather than defend immediately. "Help me understand what's giving you pause" usually surfaces the real concern. If it's a technical disagreement, I want to hear their reasoning — they know their data and context better than I do, and they might be right. If the concern is implementation-side, I can usually propose a phased approach or a pilot that lowers the risk of commitment.
I've had this dynamic internally with stakeholders at Scotiabank — particularly with IT teams and Compliance — and the pattern that works is separating the recommendation from the relationship. You can disagree firmly on the recommendation while making it clear you're on the same side of the table. If I believe the recommendation is right and they still decline, I document it clearly — for both the client's protection and ours.
In the first week, I want to understand four things: the regulatory context they're operating in, the state of their data infrastructure, who the real decision-makers are, and what the previous state of their AML program looks like — including any prior findings or enforcement actions.
I do that by reading everything first — prior engagement materials, regulatory correspondence, any model documentation that exists — before asking questions. When I do ask, I ask investigators and analysts more than I ask management, because they know where the bodies are buried. An FIU investigator will tell you in 20 minutes which alerts are actually useful and which are noise; that takes management months to acknowledge.
I also do a fast data profile on their transaction and customer data in the first week. Volume, completeness, missingness patterns, entity resolution quality. That tells me a lot about program maturity before anyone says a word.
Having built models across four jurisdictions at Scotiabank — Canada, Mexico, Dominican Republic, with cross-border complexity — I've had to learn new regulatory environments quickly. The underlying framework is the same; the jurisdiction-specific parameters are learnable in days.
I've dealt with this constantly — "build us an ML model for this jurisdiction" without a defined typology, feature set, or evaluation framework. The instinct to just start building is wrong.
My approach: force scope clarity early by proposing a concrete deliverable and asking the client to react to it, rather than asking open-ended questions. A straw-man deliverable surfaces real requirements faster than a requirements workshop. "Here's what I think a Phase 1 detection capability looks like — does this align with what you're expecting?" immediately shows you where the gap is.
I also separate what's variable from what's fixed. Regulatory requirements are fixed. Data availability is fixed. Business priorities and model scope are variable. Anchoring on the fixed constraints first limits how much the scope can drift.
On shifting requirements specifically: I document the change, understand why it changed, and assess the impact on timeline and budget before agreeing to it. Not to be obstructionist, but because clients often don't know what a scope change costs until they see it in writing.
A good internal analyst optimizes depth — they get better and better at one institution's specific data, systems, and processes. That's hugely valuable but it's a narrow lens.
A good consultant has to add three things an internal analyst usually doesn't: breadth — pattern recognition across many institutions and contexts; externality — the ability to see things a client's internal team can't because they've normalized them; and communication velocity — the ability to get to the point, make a recommendation, and stand behind it under time pressure.
The thing I'm most deliberately developing for this transition is communication velocity. In-house, you have time to be thorough. In consulting, the client is paying for judgment, not just analysis. I've been practicing that in how I present to executives at Scotiabank — leading with the recommendation and the key implication, then supporting it, rather than building through all the analysis to a conclusion.
That's a program in trouble — 2% STR conversion on 10,000 alerts means 9,800 wasted reviews per month, and it means investigators have probably stopped taking alerts seriously, which creates real regulatory risk.
I'd start with a diagnostic before prescribing anything. I'd want to know: what's generating the alert volume — rules or ML? What's the breakdown of alert types — are certain scenarios responsible for most of the false positives? What does the alert age distribution look like — are investigators working through the backlog or is it growing? And critically: of the 2% that converted, what did those cases have in common?
The likely findings are either overly broad rules with thresholds set too low, or a mismatch between the detection model and the actual risk profile of the client population.
The fix usually involves a combination of: threshold recalibration with investigator feedback informing what "useful" looks like; alert prioritization scoring — not all 10,000 alerts are equal, so stratifying them lets investigators focus on the highest-quality cases; and potentially a supervised alert scoring model trained on historical dispositions. I'd also want to understand if there are typologies genuinely absent from the ruleset, because sometimes a low STR rate means you're not detecting the right things at all, not just generating false positives.