When we started building the ingestion layer for Vendrsec, we assumed the hardest part would be the prioritization logic. It wasn't. The hardest part was answering a deceptively simple question: when Qualys says a host has CVE-2024-21413 and Tenable says the same host has CVE-2024-21413, are those the same finding?
Sounds trivial. It isn't. Different scanners use different asset identifiers, different plugin logic, different severity mappings, and different notions of what "remediated" means. Before you can prioritize anything, you need a normalized findings schema that speaks the same language regardless of where the finding originated. Most security teams either skip this step (and drown in duplicates) or spend weeks building fragile normalization scripts that break with every scanner update.
This post covers what we learned building multi-scanner correlation from scratch: where duplicates actually come from, what a workable normalized schema looks like, and the deduplication logic that doesn't introduce more problems than it solves.
Why Duplicates Happen (And Why They're Not Your Scanner's Fault)
The reflexive answer is "the scanners overlap." That's partially true, but it misses the real sources of duplication. In practice, duplicates come from three distinct places.
Asset identity divergence. Qualys identifies hosts by IP address and QID (Qualys internal ID). Tenable identifies hosts by Nessus plugin ID and uses DNS hostname as a primary identifier where available. Wiz operates on cloud resource ARNs. When you have a host at 10.0.1.42 that's also `api-gateway-prod.internal` and also `arn:aws:ec2:us-east-1:123456789:instance/i-0abc123`, you have three different asset keys pointing at one physical asset. Every finding on that host will appear three times unless you resolve the asset identity layer first.
CVE representation differences. One scanner may flag a finding as CVE-2024-1234 directly. Another may flag it as a compound finding covering CVE-2024-1234 and CVE-2024-1235 under a single plugin. A third may reference the CVE but map it to a vendor advisory (RHSA-2024:1234) rather than the CVE ID itself. These aren't strict duplicates — they're the same underlying vulnerability expressed through different reference frames.
Temporal scan overlap. If Qualys runs a full scan at 2:00 AM and Tenable runs at 6:00 AM and you process findings from both at 9:00 AM, you're looking at two snapshots of a moving target. A finding remediated at 4:00 AM will appear in Qualys but not in Tenable, and both are technically correct.
The Normalized Findings Schema
Before writing any deduplication logic, you need a schema that every scanner's output gets translated into. The schema has to be opinionated enough to enable comparison, but flexible enough to preserve scanner-specific metadata that you'll need later.
The fields we settled on for Vendrsec's normalized findings schema:
- canonical_asset_id: your internal stable identifier for the host (not the scanner's key — see below)
- cve_ids[]: normalized CVE identifiers, always in CVE-YYYY-NNNNNN format
- source_scanner: qualys | tenable | wiz | crowdstrike | rapid7
- source_finding_id: the scanner's native ID (QID, plugin ID, etc.) — preserved, not discarded
- cvss_base_score: CVSS 3.1 base score, normalized from whichever version the scanner reported
- cvss_vector: full CVSS vector string
- scanner_severity: the scanner's own severity label (Critical, High, etc.) — preserved separately from CVSS
- first_seen_ts: when this scanner first reported this finding on this asset
- last_seen_ts: the most recent scan in which this finding appeared
- remediation_state: open | fixed | accepted | reopened
- raw_finding_ref: pointer to the original scanner record, for audit purposes
Two design decisions worth explaining: why we preserve source_finding_id and scanner_severity separately from the normalized CVSS fields.
Source finding IDs matter for ticket round-trips. When Vendrsec generates a remediation ticket and pushes it to Jira, the engineer who closes it needs to know which scanner finding to verify against. "Fix CVE-2024-21413" isn't enough — they need "fix the Tenable plugin 192461 finding on api-gateway-prod." The scanner's native ID is what gets validated in the scanner's UI when they confirm remediation.
Scanner severity labels matter because they drift from CVSS. Qualys has historically scored some findings higher than NVD CVSS would suggest because their QID logic incorporates exploitability data that wasn't in the original NVD entry. When Qualys says Critical and NVD says 7.2, that gap is signal worth preserving, not discarding by standardizing to CVSS alone.
Asset Identity Resolution: The Hard Part
The normalized schema above has a canonical_asset_id field. Generating that field is the actual hard problem. You need an asset identity layer that maps scanner-specific identifiers to your internal asset inventory.
The matching strategy works in layers, applied in order:
Layer 1: Exact match on cloud resource identifier. If a finding comes from Wiz with an AWS ARN, and your asset inventory has that ARN, you have a deterministic match. Cloud-native scanners like Wiz and Crowdstrike Falcon Spotlight are generally reliable here — they operate on resource IDs rather than IP addresses.
Layer 2: IP + hostname pair match. For on-prem scanners, match on (IP address, hostname) pair. IP alone is unstable — cloud environments reassign IPs constantly. Hostname alone is unreliable in environments without consistent FQDN hygiene. The pair is more stable than either alone.
Layer 3: MAC address match. If your scanner exports MAC addresses and your inventory tracks them, MAC matching handles cases where IP and hostname both drift. Less common for cloud workloads, but useful for physical infrastructure.
Layer 4: Fuzzy hostname match. When hostname normalization diverges between scanners (one uses FQDN, another uses short name), a normalized hostname comparison after stripping domain suffixes catches most cases. This is the riskiest layer — flag these matches for human review rather than silently accepting them.
Consider a scenario: a team managing roughly 3,000 assets across a primary datacenter and two cloud regions. Qualys ran on-prem, Wiz on cloud, Tenable across both. Before building an explicit asset identity layer, their de facto merge strategy was "group by IP." This worked fine until they migrated a batch of workloads to new VPCs — the IP addresses changed, and suddenly the same logical services appeared as brand-new assets in their merged findings view. Six weeks of posture history effectively disappeared for those assets. A canonical asset ID tied to the cloud resource ARN (which persists through IP changes) would have preserved the continuity.
Deduplication: CVE-Level Correlation
Once asset identity is resolved, CVE-level deduplication is straightforward in principle: if two scanner findings share the same canonical_asset_id and the same CVE ID, they're candidates for merging into a single logical finding.
We're not saying you should discard one of the raw scanner findings — the raw records from each scanner are valuable and should be preserved. What you're building is a logical finding that aggregates across scanner sources, with references back to each contributing raw finding.
The merged logical finding gets:
- cvss_max: the highest CVSS score reported across all scanners (conservative — if one scanner scored it higher, that's worth knowing)
- sources[]: list of contributing scanner findings
- first_seen_ts: earliest first_seen across all contributing findings
- last_seen_ts: latest last_seen across all contributing findings
- remediation_state: only mark as fixed if ALL contributing scanner findings show it fixed — one scanner still seeing it open means it's not actually resolved
That last point on remediation state matters more than it looks. If Qualys marks a finding fixed but Tenable still sees it open, the most common explanation is scan timing (Tenable hasn't run since the patch). The second most common explanation is that the patch was applied to one interface but not another. The third is that the remediation verification logic between scanners differs. In all three cases, "fixed if all sources agree" is the conservative choice.
What Multi-Scanner Correlation Doesn't Fix
It's worth being direct about the limits here. Multi-scanner deduplication solves the counting problem — you stop seeing 3x the findings because you have 3 scanners. But it doesn't solve the prioritization problem.
After deduplication, you still have a merged list that's probably in the thousands, ranked by CVSS, with no business context. The highest-CVSS finding on an isolated dev box is still ranked above a medium-CVSS finding on your payment processing cluster. The correlation layer is necessary infrastructure, but it's not where the prioritization work happens.
We built Vendrsec's ingestion layer to normalize and deduplicate as the first step, then pass the merged findings to the risk scoring layer that factors in asset criticality, network reachability, and exploit intelligence. The deduplication work is invisible to users — it runs before anything they see. But without it, every downstream calculation would be inflated by phantom duplicates, and posture drift tracking over time would be meaningless noise.
If you're managing a multi-scanner environment and building this infrastructure yourself, start with the asset identity layer. Everything else — CVE correlation, deduplication logic, severity normalization — is tractable once you have stable canonical asset IDs. Asset identity is where the real complexity lives, and it's the place most teams shortcut with "just use IP" until the first major infrastructure change breaks everything downstream.