phi-shield
Detect, mask, redact, or de-identify Protected Health Information (PHI) and Personally Identifiable Information (PII) from any file or text, in compliance with HIPAA Safe Harbor (45 CFR §164.514). Use this skill whenever the user wants to: redact PHI or PII from documents, de-identify patient data, anonymize health records, mask sensitive fields before sharing data, check whether a file contains PHI, scrub clinical notes or EHR exports, prepare a dataset for research or analytics, comply with HIPAA de-identification requirements, or sanitize CSV/Excel/text/PDF/DOCX files of patient identifiers. Triggers on: PHI, PII, HIPAA, de-identify, anonymize, redact, mask, scrub, sanitize, patient data, health records, clinical notes, EHR, medical records, safe harbor, 18 identifiers, protected health information, personally identifiable.
PHI Shield — HIPAA-Compliant De-identification Skill
Detect, mask, and redact PHI/PII from structured data (CSV, Excel), unstructured text (clinical notes, emails, reports), and documents (DOCX, PDF), using a two-layer approach: regex-based pattern matching for structured identifiers + NLP-based NER for names and contextual entities.
Legal disclaimer: This skill implements HIPAA Safe Harbor de-identification (45 CFR §164.514(b)) as a technical control. It is NOT a substitute for legal counsel or a formal Expert Determination assessment. Always have qualified personnel review outputs before sharing or publishing de-identified data.
Mode selection
Ask the user which mode they need if not already specified:
| Mode | What it does | Use when |
|---|---|---|
detect | Scan and report what PHI/PII is found — no changes | Auditing a file |
mask | Replace PHI with type labels: [PATIENT_NAME], [SSN] | Readable output needed |
redact | Replace PHI with █████ or [REDACTED] | Strongest privacy |
pseudonymize | Replace with consistent fake values | Downstream analytics need structure |
safe-harbor | Full HIPAA Safe Harbor — remove all 18 identifiers | Research/sharing compliance |
Default to mask if unspecified.
Step-by-step workflow
Step 1: Identify input type
file /mnt/user-data/uploads/<filename>
stat -c '%s bytes' /mnt/user-data/uploads/<filename>
Route by extension:
.csv/.tsv→ structured pipeline (redact_structured.py).xlsx/.xls→ structured pipeline (redact_structured.py).txt/.md/.log→ unstructured pipeline (redact_text.py).docx→ unstructured pipeline (redact_text.py with docx support).pdf→ extract text first, then unstructured pipeline- Raw text pasted in chat → run inline detection (redact_text.py with stdin)
Step 2: Install dependencies
pip install pandas openpyxl python-docx pdfminer.six \
presidio-analyzer presidio-anonymizer spacy \
--break-system-packages -q
python -m spacy download en_core_web_lg --quiet 2>/dev/null || \
python -m spacy download en_core_web_sm --quiet
Step 3: Run the appropriate script
Structured data (CSV/Excel):
python /path/to/phi-shield/scripts/redact_structured.py \
"<input_path>" \
"<output_path>" \
--mode mask \
--audit /tmp/phi_audit.json
Unstructured text/DOCX/PDF:
python /path/to/phi-shield/scripts/redact_text.py \
"<input_path>" \
"<output_path>" \
--mode mask \
--audit /tmp/phi_audit.json
Inline text (pasted in chat): Write the text to /tmp/input.txt first, then run
redact_text.py on it.
Step 4: Read and present the audit report
Read /tmp/phi_audit.json after the script completes. Always show the user:
- Total PHI instances found (by category)
- Which columns/sections were affected
- Confidence breakdown (high / medium / low)
- Any items flagged for manual review
Step 5: Save and present output
cp <output_path> /mnt/user-data/outputs/<original_name>_deidentified.<ext>
Call present_files on the output and the audit JSON.
PHI categories detected
See references/phi_categories.md for the full pattern library and NER labels.
The 18 HIPAA Safe Harbor identifiers covered:
- Names (patient, relative, employer) — NER + patterns
- Geographic subdivisions < state (address, city, county, ZIP) — patterns + NER
- Dates (except year): birth, admission, discharge, death — patterns
- Phone numbers — patterns
- Fax numbers — patterns
- Email addresses — patterns
- Social Security numbers — patterns
- Medical record numbers — patterns
- Health plan beneficiary numbers — patterns
- Account numbers — patterns
- Certificate / license numbers — patterns
- Vehicle identifiers and license plates — patterns
- Device identifiers and serial numbers — patterns
- URLs — patterns
- IP addresses — patterns
- Biometric identifiers (finger/voice prints) — keyword detection
- Full-face photos and comparable images — flagged (cannot auto-redact image content)
- Any other unique identifying code — heuristic + patterns
Additional PII (non-HIPAA but commonly needed):
- Passport numbers
- Credit card numbers (PAN)
- Bank account / routing numbers
- National ID numbers (non-US)
- Gender / race / ethnicity (quasi-identifier, flagged with low confidence)
- Employer names (quasi-identifier)
Output quality standards
- Masks must be consistent within a document: the same name always maps to the
same token (
[PATIENT_NAME_1],[PATIENT_NAME_2], etc.) - Dates must be handled per HIPAA rule: remove month/day, keep year UNLESS age > 89 (in which case replace with "90+")
- ZIP codes: keep first 3 digits only if that 3-digit area has > 20,000 people,
else replace with
000— seereferences/phi_categories.mdfor the rule - Audit report must list every detection with: category, confidence, line/column
reference, action taken, and a non-reversible token (salted hash) referencing the
value — original PHI must never be persisted in audit logs (see
references/audit_schema.md) - Never log or echo original PHI values to stdout in production mode
- If confidence < 0.6 on any detection, flag it in audit as "needs manual review"
Reference files
references/phi_categories.md— Full regex pattern library + NER label mappingsreferences/safe_harbor_rules.md— Exact HIPAA 45 CFR §164.514 rules with implementation guidance for edge cases (ZIP, dates, ages, re-ID codes)references/gdpr_extensions.md— Additional rules for GDPR Art. 4 / UK GDPR when European patient data is involvedreferences/audit_schema.md— JSON schema for the audit report output
Related Assets
Optum Harmony Healthcare Demo App
Create a Harmony-based example healthcare application that showcases eligibility, claims, and remittance concepts using current Harmony skills, instructions, navigation, forms, and components.
Owner: harmony-platform
AIRB Submission Prep (Optum)
Prepare a complete AIRB submission package and checklist for a UAIS/LLM project following RAI Development Guide v3.0 requirements.
Owner: epic-platform-sre
UHG/Optum GitHub Actions Compliance Policy
Corporate policy for allowed GitHub Actions sources in workflows
Owner: thudak
Optum Responsible AI (RAI) compliance
Responsible AI compliance requirements for Optum AI/ML development, covering AIRB submission, shadow mode pilots, RAI risk tiers, and governance processes.
Owner: epic-platform-sre
epic-expert
Epic EMR healthcare software, infrastructure deployment on Azure, ODB/Citrix/Hyperspace architecture, and operational patterns
Owner: epic-platform-sre
janus
Secrets management across vaults, privileged stores, and certificate systems
Owner: epic-platform-sre

