AI detection
in three stages.
Not one model. A pipeline.
Reliably detecting personal data sounds simple – but it isn't. Names can also be adjectives, addresses appear in running text, case numbers follow no universal pattern. anymize solves this with a three-stage pipeline: algorithmic pre-detection, our own fine-tuned small model, and a larger model for post-verification. Together, over 95 % detection rate on German-language documents – with full transparency about what was detected where.
The three stages
From 60 % to 95 %.
In three passes.
A single model would either produce too many false positives (and your contract becomes a desert of placeholders) or miss too much (and sensitive data lands with external models). The craft: three specialized layers, each verifying and correcting the previous one.
Algorithmic detection
Fast. Deterministic. Cost-neutral.
Regular expressions, named-entity dictionaries, format validators (IBAN checksums, ID-card structures, phone-number patterns). Catches around 60 to 70 percent of personal data in typical documents – everything that is clearly structured.
Very fast, cost-neutral, a hundred percent reproducible.
Names and context-dependent entities slip through, because regex has no semantics.
Our own fine-tuning model
Semantic. Iterative. German professional domains.
A small language model specialized on PII detection, post-trained on German-language expert texts (legal, medical, commercial). It runs multiple iterations over the document and identifies everything stage 1 missed: context-dependent names (“Dr. Weber decided …”), case numbers with atypical form, embedded diagnostic codes, organizations.
Understands semantics and context, learns from German professional domains.
Not perfect – a few rare cases slip through.
Prompt-based post-verification
Reasoning layer. Sees the full picture.
A larger model takes the third and final pass: it receives the document plus the markings from stages 1 and 2 and checks via a structured prompt whether anything is missing or mis-marked. Catches cases that escaped the finer-grained stages – and cleans up false positives before they disrupt the flow of text.
Sees the full picture, can decide based on reasoning.
More compute-intensive – which is why it's the last stage and not the only step.
The result
detection rate
In practice, the combination of the three stages delivers a detection rate of over 95 % on German-language documents – significantly more than any stage on its own. And at the same time fewer false positives, because each layer validates the previous one.
Why not one single, large model?
Speed. Stage 1 handles the bulk in milliseconds – the large model only runs on the remaining open cases.
Explainability. We can show in which stage each result emerged. That matters for audits.
More than 40 categories
What we
detect.
Category coverage grows continuously. Today, anymize detects more than 40 classes of personal and business-sensitive data, grouped into five families.
Identifiers
- Names (first name, last name, title)
- Email addresses
- Phone numbers
- Addresses (street, ZIP, city)
- Organizations
- Dates of birth
Government and contract IDs
- Tax IDs
- Social and pension insurance numbers
- ID, passport, driver's license numbers
- License plates
- Case numbers, contract IDs
Financial data
- IBANs (with checksum validation)
- BICs
- Credit card numbers
- Account numbers
- Tax numbers
Industry-specific identifiers
- Mandate and insurance numbers
- Claim numbers
- Patient IDs
- ICD diagnosis codes (in preparation)
- Patent registrations (in preparation)
Contextual data
- Illnesses and medical terminology
- Industry-specific vocabulary (when marked sensitive)
- Geo references in combination
Why not regex?
Classic approaches
in the reality test.
Many PII tools on the market are purely rule-based – using regular expressions and static dictionaries. That works for clearly structured data (IBAN, phone numbers), but fails on what makes up the bulk of sensitive content: free text with context.
“Mrs. Weber signs on Monday.”
Catches “Weber” only if in the dictionary – otherwise: miss.
Recognizes the context “Mrs. + last name” and marks reliably.
“The client, Mr. Schmidt from Mainz, …”
Might catch “Schmidt”, but not the connection with “client”.
Recognizes the client relationship and marks completely.
“Anton” (as a first name) vs. “Hotel Anton”
Cannot distinguish – either anonymize both (false positive) or neither (miss).
Makes a context-aware decision.
The consequence
Regex systems rarely exceed 70–80 % detection – and produce either many false positives (the anonymized document is unreadable) or too many misses (sensitive data ends up at the external model anyway). Both are unacceptable in a compliance context.
anymize uses regex as the first stage – because it is fast and deterministic – and supplements it with two AI layers that catch exactly where regex fails. That is the reason for the over 95 % detection rate.
Languages & context
Five languages,
many domains.
German has the highest detection quality because our fine-tuning model is trained explicitly on German expert texts. For the other languages the rate typically ranges from 88–93 % – depending on domain and document structure.
- Primary training focusDE
German
Target > 95 %
- EN
English
88–93 %
- FR
French
88–93 %
- ES
Spanish
88–93 %
- IT
Italian
88–93 %
Domain coverage
The fine-tuning dataset covers three expert domains with demanding requirements:
Legal
Pleadings, contracts, court decisions, case-law databases.
Medical
Treatment guidelines, findings, expert publications, therapy documentation.
Commercial
Annual reports, contracts, tax literature, business vocabulary.
For other domains (e.g. engineering, architecture, specialized sciences) the system still reaches the advertised rate – because the base models are generalists – but shows less domain-specific finesse. For highly specialized domains we offer individually fine-tuned models in the Enterprise plan.
Transparency
Four-eyes control
built in.
A detection rate of “over 95 %” means: in five out of a hundred cases, something may slip through. For compliance-critical processes that is too much – which is why anymize builds transparency firmly into the workflow.
Before sending: the review view
Before every request to an external model, the interface shows you what was detected and what wasn't. Highlights in the original text, categories per find, counts per category. If something important is missing, mark it manually. If something was over-marked, correct it – and the AI remembers that for your workspace.
The 12-second countdown (enforceable)
Admins can enforce a review countdown before every send: the interface shows the anonymized version, runs 12 seconds, then goes out. The user has time to review and cancel. For fully vigilant use.
Audit log
Every detection (what, when, which model, which stage) is recorded in the audit log. For compliance evidence and internal quality assurance.
Roadmap
What we're currently
working on.
Detection of personal data is not a solved problem – three current development tracks show where the journey is going.
Indirectly personal data
A sentence like “The mayor of city X decided …” contains no name, but a person is clearly identifiable. The GDPR treats such statements as personal data (recital 26). We're developing a combination analysis that catches such identifying contexts – role + location, function + organization, unique attributes.
Trade secrets, patents, formulas
Personal data isn't the only thing worth protecting. Companies have the same interest in ensuring patent ideas, chemical formulas, product prototypes and internal processes don't reach an external model unintentionally. We're extending detection with categories for these contents – as an optional layer on top of PII detection.
Individually fine-tuned models
Every company has its own terms, abbreviations, product codes that should count as sensitive. In the Enterprise plan we offer individual fine-tuning on your trade secrets – the detection model learns your company specifics and marks them in addition to the standard categories. Interested parties reach out directly.
For whom
Who benefits most
from precise detection.
For all these contexts: regex is not enough. Human post-editing takes hours. AI-based detection at the level of a three-stage system is the only practical answer.
Client name in running text, case numbers with atypical form, indirect hints in pleadings.
Patient name in findings text, medical terminology with personal references, diagnosis combinations.
Claim reports with mixed formats, descriptions with indirect identifiers.
Applications with narrative structure (no forms), employment references.
Interview transcripts, research notes, free-form due-diligence reports.
Citizen data in prose notices, social data with indirect hints.
What you should know about detection.
Frequently asked questions
Three reasons: (1) Speed – stage 1 (regex) catches the bulk in milliseconds, the large model only runs on the remaining open cases. (2) Cost – pure prompt-based detection on a large model would be many times more expensive per document. (3) Explainability – for audits we can show in which stage each entity was detected, with which reasoning.
We stand behind anymize. And we know – when an AI tool touches client, patient or employee data, a demo video isn't enough. That's why we give you 14 days of full access – all models, all features, no credit card. Enough time to be certain, before you trust us.
Your AI workplace awaits.