How did you arrive at “over 95 %” detection rate?

Validated internally on curated German-language expert-text corpora with manual ground-truth annotation. The value is a target, not a guarantee level – depending on document type and language, the rate may be higher or lower. For compliance-critical processes, we additionally recommend the four-eyes control before sending.

Which languages are supported?

German (highest quality, primary training focus), English, French, Spanish, Italian. Mixed-language documents are recognized without manual switching.

Is indirectly personal data also detected?

Currently with limitations. Pure name and ID detection works reliably, but indirect identification (e.g. “the mayor of city X”) is a research area we're working on. We're extending detection with a combination analysis that catches such contextual identifiers. In the Enterprise plan we can grant early access on request.

Can trade secrets be detected as well?

In progress. We're extending the detection system with optionally enabled layers for patents, formulas, product codes and internal process descriptions. For Enterprise customers we also offer individual fine-tuning – the model learns your company-specific sensitive terms.

What happens to false positives (over-marked)?

In the review view before sending, you can undo markings – the corresponding entry is then marked as a false positive, and the feedback loop improves the model for your workspace. This is in addition to the global training loop.

What happens to misses (under-marked)?

In the review view you can manually add markings. These corrections also feed into quality assurance. The optional manual release before sending (enforceable at the admin level) prevents a miss from slipping through unnoticed.

How does this compare to AWS Comprehend, Azure PII Detection, Google DLP?

The cloud-native PII tools of US hyperscalers are optimized for English standard formats – credit cards, SSN, US addresses. For European professional texts – case numbers, IBANs, lawyer-confidentiality contexts (e.g. § 43e BRAO in Germany), medical terminology – their detection rate is noticeably below specialized systems like anymize. Added to that: the tools themselves are cloud services in the US – you merely shift the privacy problem instead of solving it.

Can I add my own categories?

In the Enterprise plan, yes. Alongside standard categories, you can define company-specific markers (e.g. internal project codes, customer keys, product variants). These are treated like PII and anonymized before the document goes to an external model.

AI detection
in three stages.

Not one model. A pipeline.

Reliably detecting personal data sounds simple – but it isn't. Names can also be adjectives, addresses appear in running text, case numbers follow no universal pattern. anymize solves this with a three-stage pipeline: algorithmic pre-detection, our own fine-tuned small model, and a larger model for post-verification. Together, over 95 % detection rate on German-language documents – with full transparency about what was detected where.

The three stages Start for free

The three stages

From 60 % to 95 %.
In three passes.

A single model would either produce too many false positives (and your contract becomes a desert of placeholders) or miss too much (and sensitive data lands with external models). The craft: three specialized layers, each verifying and correcting the previous one.

Stage01

Algorithmic detection

Fast. Deterministic. Cost-neutral.

Regular expressions, named-entity dictionaries, format validators (IBAN checksums, ID-card structures, phone-number patterns). Catches around 60 to 70 percent of personal data in typical documents – everything that is clearly structured.

Coverage after this stage

0 %60 – 70 %100 %

Strengths

Very fast, cost-neutral, a hundred percent reproducible.

Limits

Names and context-dependent entities slip through, because regex has no semantics.

Stage02

Our own fine-tuning model

Semantic. Iterative. German professional domains.

A small language model specialized on PII detection, post-trained on German-language expert texts (legal, medical, commercial). It runs multiple iterations over the document and identifies everything stage 1 missed: context-dependent names (“Dr. Weber decided …”), case numbers with atypical form, embedded diagnostic codes, organizations.

Coverage after this stage

0 %+ 10 – 15 %100 %

Strengths

Understands semantics and context, learns from German professional domains.

Limits

Not perfect – a few rare cases slip through.

Stage03

Prompt-based post-verification

Reasoning layer. Sees the full picture.

A larger model takes the third and final pass: it receives the document plus the markings from stages 1 and 2 and checks via a structured prompt whether anything is missing or mis-marked. Catches cases that escaped the finer-grained stages – and cleans up false positives before they disrupt the flow of text.

Coverage after this stage

0 %+ 13 – 30 %100 %

Strengths

Sees the full picture, can decide based on reasoning.

Limits

More compute-intensive – which is why it's the last stage and not the only step.

The result

> 95 %

detection rate

In practice, the combination of the three stages delivers a detection rate of over 95 % on German-language documents – significantly more than any stage on its own. And at the same time fewer false positives, because each layer validates the previous one.

Why not one single, large model?

Speed. Stage 1 handles the bulk in milliseconds – the large model only runs on the remaining open cases.
Explainability. We can show in which stage each result emerged. That matters for audits.

More than 40 categories

What we
detect.

Category coverage grows continuously. Today, anymize detects more than 40 classes of personal and business-sensitive data, grouped into five families.

Identifiers

Names (first name, last name, title)
Email addresses
Phone numbers
Addresses (street, ZIP, city)
Organizations
Dates of birth

Government and contract IDs

Tax IDs
Social and pension insurance numbers
ID, passport, driver's license numbers
License plates
Case numbers, contract IDs

Financial data

IBANs (with checksum validation)
BICs
Credit card numbers
Account numbers
Tax numbers

Industry-specific identifiers

Mandate and insurance numbers
Claim numbers
Patient IDs
ICD diagnosis codes (in preparation)
Patent registrations (in preparation)

Contextual data

Illnesses and medical terminology
Industry-specific vocabulary (when marked sensitive)
Geo references in combination

Full list with placeholder formats

Why not regex?

Classic approaches
in the reality test.

Many PII tools on the market are purely rule-based – using regular expressions and static dictionaries. That works for clearly structured data (IBAN, phone numbers), but fails on what makes up the bulk of sensitive content: free text with context.

Example	Regex system	AI detection
“Mrs. Weber signs on Monday.”	Catches “Weber” only if in the dictionary – otherwise: miss.	Recognizes the context “Mrs. + last name” and marks reliably.
“The client, Mr. Schmidt from Mainz, …”	Might catch “Schmidt”, but not the connection with “client”.	Recognizes the client relationship and marks completely.
“Anton” (as a first name) vs. “Hotel Anton”	Cannot distinguish – either anonymize both (false positive) or neither (miss).	Makes a context-aware decision.

“Mrs. Weber signs on Monday.”

Regex

Catches “Weber” only if in the dictionary – otherwise: miss.

anymize

Recognizes the context “Mrs. + last name” and marks reliably.

“The client, Mr. Schmidt from Mainz, …”

Regex

Might catch “Schmidt”, but not the connection with “client”.

anymize

Recognizes the client relationship and marks completely.

“Anton” (as a first name) vs. “Hotel Anton”

Regex

Cannot distinguish – either anonymize both (false positive) or neither (miss).

anymize

Makes a context-aware decision.

The consequence

Regex systems rarely exceed 70–80 % detection – and produce either many false positives (the anonymized document is unreadable) or too many misses (sensitive data ends up at the external model anyway). Both are unacceptable in a compliance context.

anymize uses regex as the first stage – because it is fast and deterministic – and supplements it with two AI layers that catch exactly where regex fails. That is the reason for the over 95 % detection rate.

Languages & context

Five languages,
many domains.

German has the highest detection quality because our fine-tuning model is trained explicitly on German expert texts. For the other languages the rate typically ranges from 88–93 % – depending on domain and document structure.

Supported languages

Primary training focus

German

Target > 95 %

English

88–93 %

French

88–93 %

Spanish

88–93 %

Italian

88–93 %

Domain coverage

The fine-tuning dataset covers three expert domains with demanding requirements:

Legal

Pleadings, contracts, court decisions, case-law databases.

Medical

Treatment guidelines, findings, expert publications, therapy documentation.

Commercial

Annual reports, contracts, tax literature, business vocabulary.

For other domains (e.g. engineering, architecture, specialized sciences) the system still reaches the advertised rate – because the base models are generalists – but shows less domain-specific finesse. For highly specialized domains we offer individually fine-tuned models in the Enterprise plan.

Transparency

Four-eyes control
built in.

A detection rate of “over 95 %” means: in five out of a hundred cases, something may slip through. For compliance-critical processes that is too much – which is why anymize builds transparency firmly into the workflow.

Before sending: the review view

Before every request to an external model, the interface shows you what was detected and what wasn't. Highlights in the original text, categories per find, counts per category. If something important is missing, mark it manually. If something was over-marked, correct it – and the AI remembers that for your workspace.

Manual release before sending (enforceable)

Admins can enforce an active release before every send: the interface shows the anonymized version – it only goes out once you've reviewed and deliberately confirmed it. No automatic forwarding. For fully vigilant use.

Audit log

Every detection (what, when, which model, which stage) is recorded in the audit log. For compliance evidence and internal quality assurance.

What we're currently
working on.

Detection of personal data is not a solved problem – three current development tracks show where the journey is going.

01In development

Indirectly personal data

A sentence like “The mayor of city X decided …” contains no name, but a person is clearly identifiable. The GDPR treats such statements as personal data (recital 26). We're developing a combination analysis that catches such identifying contexts – role + location, function + organization, unique attributes.

02Under way

Trade secrets, patents, formulas

Personal data isn't the only thing worth protecting. Companies have the same interest in ensuring patent ideas, chemical formulas, product prototypes and internal processes don't reach an external model unintentionally. We're extending detection with categories for these contents – as an optional layer on top of PII detection.

03Enterprise

Individually fine-tuned models

Every company has its own terms, abbreviations, product codes that should count as sensitive. In the Enterprise plan we offer individual fine-tuning on your trade secrets – the detection model learns your company specifics and marks them in addition to the standard categories. Interested parties reach out directly.

For whom

Who benefits most
from precise detection.

For all these contexts: regex is not enough. Human post-editing takes hours. AI-based detection at the level of a three-stage system is the only practical answer.

Profession	Why AI-based instead of regex
Lawyers and attorneys	Client name in running text, case numbers with atypical form, indirect hints in pleadings.
Doctors and physicians	Patient name in findings text, medical terminology with personal references, diagnosis combinations.
Insurance companies	Claim reports with mixed formats, descriptions with indirect identifiers.
HR departments	Applications with narrative structure (no forms), employment references.
Consultancies	Interview transcripts, research notes, free-form due-diligence reports.
Public administration	Citizen data in prose notices, social data with indirect hints.

Lawyers and attorneys

Client name in running text, case numbers with atypical form, indirect hints in pleadings.

Doctors and physicians

Patient name in findings text, medical terminology with personal references, diagnosis combinations.

Insurance companies

Claim reports with mixed formats, descriptions with indirect identifiers.

HR departments

Applications with narrative structure (no forms), employment references.

Consultancies

Interview transcripts, research notes, free-form due-diligence reports.

Public administration

Citizen data in prose notices, social data with indirect hints.

What you should know about detection.

Frequently asked questions

Three reasons: (1) Speed – stage 1 (regex) catches the bulk in milliseconds, the large model only runs on the remaining open cases. (2) Cost – pure prompt-based detection on a large model would be many times more expensive per document. (3) Explainability – for audits we can show in which stage each entity was detected, with which reasoning.

Start now.
14 days free trial.

All models. All features. No credit card.

Start for free How it works

We stand behind anymize. And we know – when an AI tool touches client, patient or employee data, a demo video isn't enough. That's why we give you 14 days of full access – all models, all features, no credit card. Enough time to be certain, before you trust us.

Your AI workplace awaits.

AI detectionin three stages.

From 60 % to 95 %.In three passes.

Algorithmic detection

Our own fine-tuning model

Prompt-based post-verification

What wedetect.

Identifiers

Government and contract IDs

Financial data

Industry-specific identifiers

Contextual data

Classic approachesin the reality test.

Five languages,many domains.

Supported languages

Domain coverage

Legal

Medical

Commercial

Four-eyes controlbuilt in.

Before sending: the review view

Manual release before sending (enforceable)

Audit log

What we're currentlyworking on.

Indirectly personal data

Trade secrets, patents, formulas

Individually fine-tuned models

Who benefits mostfrom precise detection.

What you should know about detection.

Start now.14 days free trial.

AI detection
in three stages.

From 60 % to 95 %.
In three passes.

What we
detect.

Classic approaches
in the reality test.

Five languages,
many domains.

Four-eyes control
built in.

What we're currently
working on.

Who benefits most
from precise detection.

Start now.
14 days free trial.