When de-identification of sensitive health and personal data becomes critical

Oct. 11, 2019 / By Octavian Weiser, MD

Since the introduction of the General Data Protection Regulation (GDPR), handling and exchanging documents with personal and sensitive health data in a responsible way has become even more critical. Due to the topic’s highly sensitive nature, third party data exchange or usage in hospitals needs to be reevaluated. Documents clinicians use for training, external or internal audits, testing module functionality or developing new features in hospital information systems need pseudonymization [8(28)]. Secure storage of patient information in the cloud or hospital database needs pseudonymization and encryption of personal data [8/32.1.a].

3M approaches the de-identification of free text in various European languages (e.g. German, French, Flemish, English, Spanish, Italian) using natural language pseudonymization and obfuscation techniques. Pseudonymization is defined as the separation of data from direct identifiers (such as first name or social security number) so that linkage to the original identity is impossible to make without additional information. A pseudonymization table is generated and stored separately on highly secure servers for real-time re-identification of patient documents. Additionally, 3M applies natural obfuscation techniques: sensitive or personal data is exchanged with the same type, gender and language/region data, (e.g. a female name is replaced by another female name, common in the local culture, Gritzenweg is replaced with Rosengasse), throughout the entire document, naturally preserving context, medical data and readability.

The developed algorithms distinguish between sensitive health and personal information and general/medical concepts needed for clinical decisions, clinical studies recruiting or coding by properly handling whitespace, markup, punctuation, nested expressions (e.g. Rue de Wilson) and compositional models (e.g. Carl-Schurz-Platz or Aktivierungsvorrichtung). Preserving local culture and meaning by utilizing careful same-type replacements enables identical output of document annotations for computer-assisted coding systems. Evaluation of English originals and their pseudonymized counterparts show identical assigned codes and identified evidence in examined texts. Randomly shifting dates occurs consistently, preserving time intervals as needed for clinical, research and administrative inferences and decisions.

The de-identification process in various languages is based on a training database of over 3 million elements, with natural substitutions that are not detectable as such by human assessors.

Quality assurance displays a current baseline recall of over 95 percent for 19 distinct data elements (including detailed addresses). The process is initially tuned to aggressively recognize the various types of sensitive healthcare data with maximum recall. In a subsequent development phase fine-tuning for precision response,  proper names are disambiguated in context without affecting recall.

Pseudonymized documents are basically used for training, auditing, research, support and development and generally need a one-way transformation from common HIS formats (pdf, rtf, docx, etc.) to plain text, xml (CDA, FHIR) or html. Therefore, plain text was extracted from electronic documents of diverse formats for processing. When re-identification of stored documents is needed, e.g. documents retrieved in real-time from a HIS database or cloud, a bidirectional conversion is imperative to render the re-identified document “as originally created”, respecting the layout of the document.

Dr. Octavian Weiser is a senior analyst within the European Clinical and Economic Research group for 3M Health Information Systems.


  1. ACHARYA, Subrata; PATEL, Anoli. Towards the design of a comprehensive data de-identification solution. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2017. 1-8.
  2. NEAMATULLAH, Ishna, et al. Automated de-identification of free-text medical records. BMC medical informatics and decision making, 2008, 8. Jg., Nr. 1, S. 32.
  3. NEUBAUER, Th, et al. Pseudonymisierung für die datenschutzkonforme Speicherung medizinischer Daten. e & i Elektrotechnik und Informationstechnik, 2010, 127. Jg., Nr. 5, S. 135-142.
  4. REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) – https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=DE – accessed on Sept 19th, 2019
  5. WERMTER, J.; TOMANEK, K.; BALZER, F. Automatische Erkennung und effiziente Annotation von anonymisierungsrelevanten Begriffen in klinischen Freitexten Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie.