AI detection
in three stages.

Not one model. A pipeline.

Reliably detecting personal data sounds simple – but it isn't. Names can also be adjectives, addresses appear in running text, case numbers follow no universal pattern. anymize solves this with a three-stage pipeline: algorithmic pre-detection, our own fine-tuned small model, and a larger model for post-verification. Together, over 95 % detection rate on German-language documents – with full transparency about what was detected where.

The three stages

From 60 % to 95 %.
In three passes.

A single model would either produce too many false positives (and your contract becomes a desert of placeholders) or miss too much (and sensitive data lands with external models). The craft: three specialized layers, each verifying and correcting the previous one.

Stage01

Algorithmic detection

Fast. Deterministic. Cost-neutral.

Regular expressions, named-entity dictionaries, format validators (IBAN checksums, ID-card structures, phone-number patterns). Catches around 60 to 70 percent of personal data in typical documents – everything that is clearly structured.

0 %60 – 70 %100 %
Strengths

Very fast, cost-neutral, a hundred percent reproducible.

Limits

Names and context-dependent entities slip through, because regex has no semantics.

Stage02

Our own fine-tuning model

Semantic. Iterative. German professional domains.

A small language model specialized on PII detection, post-trained on German-language expert texts (legal, medical, commercial). It runs multiple iterations over the document and identifies everything stage 1 missed: context-dependent names (“Dr. Weber decided …”), case numbers with atypical form, embedded diagnostic codes, organizations.

0 %+ 10 – 15 %100 %
Strengths

Understands semantics and context, learns from German professional domains.

Limits

Not perfect – a few rare cases slip through.

Stage03

Prompt-based post-verification

Reasoning layer. Sees the full picture.

A larger model takes the third and final pass: it receives the document plus the markings from stages 1 and 2 and checks via a structured prompt whether anything is missing or mis-marked. Catches cases that escaped the finer-grained stages – and cleans up false positives before they disrupt the flow of text.

0 %+ 13 – 30 %100 %
Strengths

Sees the full picture, can decide based on reasoning.

Limits

More compute-intensive – which is why it's the last stage and not the only step.

The result

> 95 %

detection rate

In practice, the combination of the three stages delivers a detection rate of over 95 % on German-language documents – significantly more than any stage on its own. And at the same time fewer false positives, because each layer validates the previous one.

Why not one single, large model?

  • Speed. Stage 1 handles the bulk in milliseconds – the large model only runs on the remaining open cases.

  • Explainability. We can show in which stage each result emerged. That matters for audits.

More than 40 categories

What we
detect.

Category coverage grows continuously. Today, anymize detects more than 40 classes of personal and business-sensitive data, grouped into five families.

01

Identifiers

  • Names (first name, last name, title)
  • Email addresses
  • Phone numbers
  • Addresses (street, ZIP, city)
  • Organizations
  • Dates of birth
02

Government and contract IDs

  • Tax IDs
  • Social and pension insurance numbers
  • ID, passport, driver's license numbers
  • License plates
  • Case numbers, contract IDs
03

Financial data

  • IBANs (with checksum validation)
  • BICs
  • Credit card numbers
  • Account numbers
  • Tax numbers
04

Industry-specific identifiers

  • Mandate and insurance numbers
  • Claim numbers
  • Patient IDs
  • ICD diagnosis codes (in preparation)
  • Patent registrations (in preparation)
05

Contextual data

  • Illnesses and medical terminology
  • Industry-specific vocabulary (when marked sensitive)
  • Geo references in combination

Why not regex?

Classic approaches
in the reality test.

Many PII tools on the market are purely rule-based – using regular expressions and static dictionaries. That works for clearly structured data (IBAN, phone numbers), but fails on what makes up the bulk of sensitive content: free text with context.

“Mrs. Weber signs on Monday.”

Regex

Catches “Weber” only if in the dictionary – otherwise: miss.

anymize

Recognizes the context “Mrs. + last name” and marks reliably.

“The client, Mr. Schmidt from Mainz, …”

Regex

Might catch “Schmidt”, but not the connection with “client”.

anymize

Recognizes the client relationship and marks completely.

“Anton” (as a first name) vs. “Hotel Anton”

Regex

Cannot distinguish – either anonymize both (false positive) or neither (miss).

anymize

Makes a context-aware decision.

The consequence

Regex systems rarely exceed 70–80 % detection – and produce either many false positives (the anonymized document is unreadable) or too many misses (sensitive data ends up at the external model anyway). Both are unacceptable in a compliance context.

anymize uses regex as the first stage – because it is fast and deterministic – and supplements it with two AI layers that catch exactly where regex fails. That is the reason for the over 95 % detection rate.

Languages & context

Five languages,
many domains.

German has the highest detection quality because our fine-tuning model is trained explicitly on German expert texts. For the other languages the rate typically ranges from 88–93 % – depending on domain and document structure.

Supported languages
  • Primary training focus
    DE

    German

    Target > 95 %

  • EN

    English

    88–93 %

  • FR

    French

    88–93 %

  • ES

    Spanish

    88–93 %

  • IT

    Italian

    88–93 %

Domain coverage

The fine-tuning dataset covers three expert domains with demanding requirements:

Legal

Pleadings, contracts, court decisions, case-law databases.

Medical

Treatment guidelines, findings, expert publications, therapy documentation.

Commercial

Annual reports, contracts, tax literature, business vocabulary.

For other domains (e.g. engineering, architecture, specialized sciences) the system still reaches the advertised rate – because the base models are generalists – but shows less domain-specific finesse. For highly specialized domains we offer individually fine-tuned models in the Enterprise plan.

Transparency

Four-eyes control
built in.

A detection rate of “over 95 %” means: in five out of a hundred cases, something may slip through. For compliance-critical processes that is too much – which is why anymize builds transparency firmly into the workflow.

A

Before sending: the review view

Before every request to an external model, the interface shows you what was detected and what wasn't. Highlights in the original text, categories per find, counts per category. If something important is missing, mark it manually. If something was over-marked, correct it – and the AI remembers that for your workspace.

B

The 12-second countdown (enforceable)

Admins can enforce a review countdown before every send: the interface shows the anonymized version, runs 12 seconds, then goes out. The user has time to review and cancel. For fully vigilant use.

C

Audit log

Every detection (what, when, which model, which stage) is recorded in the audit log. For compliance evidence and internal quality assurance.

Roadmap

What we're currently
working on.

Detection of personal data is not a solved problem – three current development tracks show where the journey is going.

01In development

Indirectly personal data

A sentence like “The mayor of city X decided …” contains no name, but a person is clearly identifiable. The GDPR treats such statements as personal data (recital 26). We're developing a combination analysis that catches such identifying contexts – role + location, function + organization, unique attributes.

02Under way

Trade secrets, patents, formulas

Personal data isn't the only thing worth protecting. Companies have the same interest in ensuring patent ideas, chemical formulas, product prototypes and internal processes don't reach an external model unintentionally. We're extending detection with categories for these contents – as an optional layer on top of PII detection.

03Enterprise

Individually fine-tuned models

Every company has its own terms, abbreviations, product codes that should count as sensitive. In the Enterprise plan we offer individual fine-tuning on your trade secrets – the detection model learns your company specifics and marks them in addition to the standard categories. Interested parties reach out directly.

For whom

Who benefits most
from precise detection.

For all these contexts: regex is not enough. Human post-editing takes hours. AI-based detection at the level of a three-stage system is the only practical answer.

Lawyers and attorneys

Client name in running text, case numbers with atypical form, indirect hints in pleadings.

Doctors and physicians

Patient name in findings text, medical terminology with personal references, diagnosis combinations.

Insurance companies

Claim reports with mixed formats, descriptions with indirect identifiers.

HR departments

Applications with narrative structure (no forms), employment references.

Consultancies

Interview transcripts, research notes, free-form due-diligence reports.

Public administration

Citizen data in prose notices, social data with indirect hints.

What you should know about detection.

Frequently asked questions

Three reasons: (1) Speed – stage 1 (regex) catches the bulk in milliseconds, the large model only runs on the remaining open cases. (2) Cost – pure prompt-based detection on a large model would be many times more expensive per document. (3) Explainability – for audits we can show in which stage each entity was detected, with which reasoning.

Start now.
14 days free trial.

All models. All features. No credit card.

We stand behind anymize. And we know – when an AI tool touches client, patient or employee data, a demo video isn't enough. That's why we give you 14 days of full access – all models, all features, no credit card. Enough time to be certain, before you trust us.

Your AI workplace awaits.