Sanitizing Code and Data Before Sending to AI

Why This Matters

Most AI services process your inputs on remote infrastructure. Anthropic, OpenAI, and others have data handling policies, but sending sensitive data still creates real risks:

NDA and IP exposure — proprietary algorithms, unreleased product names, and business logic can appear in prompts without you realising.
Personal data regulations — GDPR, CCPA, and similar laws apply to personal data you control. Submitting it to a third party requires a lawful basis and, usually, a data processing agreement.
Credential leaks — API keys, database connection strings, and tokens pasted from config files or logs are a persistent risk.
Customer trust — your users gave you their data, not your toolchain.

None of this means AI tools are off-limits. It means developing a habit: scan before you paste.

The Four Categories to Watch

1. Credentials and Secrets

These are the highest-risk items because they are immediately actionable if leaked.

API keys (sk-..., AKIA..., bearer tokens)
Database connection strings (postgres://user:password@host/db)
Private keys, certificates, JWT secrets
OAuth client secrets, webhook signing keys
Anything inside .env files

Non-Obvious Risk

Credentials often appear in stack traces and logs, not just config files. A connection refused error can include the full database URL. A failed HTTP request can include an Authorization header. Always check log snippets before pasting.

2. Personally Identifiable Information (PII)

Any data that identifies a real person falls into this category:

Names, email addresses, phone numbers
IP addresses (often PII under GDPR)
User IDs or session tokens that map to real people
Health or financial data — doubly regulated
Geolocation data

A common pattern: you're debugging a failing query and paste the query with real data from a recent test. The query itself is fine to share; the embedded customer email is not.

3. Business Logic and Trade Secrets

Harder to define, but worth considering for anything under NDA or in an unreleased product:

Pricing algorithms, margin calculations
Unreleased feature names or product roadmap signals
Proprietary scoring models
Internal system names, codenames, or architecture diagrams

The test: would you paste this in a public GitHub issue? If not, it may not belong in an AI prompt either.

4. Customer Data in Sample Payloads

Developers frequently paste JSON payloads, database rows, or API responses to ask AI for help parsing or transforming them. The shape of the data is usually what you need — not the values.

// Don't paste this to ask about schema design:
{
  "user_id": "u_4829",
  "email": "jane.smith@acme.com",
  "created_at": "2024-03-15",
  "balance": 4200.00
}

// Paste this instead:
{
  "user_id": "u_XXXXX",
  "email": "user@example.com",
  "created_at": "YYYY-MM-DD",
  "balance": 0.00
}

Sanitizing Code Before Pasting

Quick Manual Pass (30 seconds)

Before pasting any code file or snippet, do a quick scan for:

Anything hardcoded that looks like a secret — search for = followed by a long string
Real email addresses, IDs, or names in comments, test data, or constants
Connection strings in config imports or environment access patterns

Replace with placeholders: YOUR_API_KEY, user@example.com, REDACTED. You don't need to anonymize variable names or business logic — just the values.

Automated Pre-Scan (for teams)

Tools like TruffleHog and Gitleaks can scan files for credentials before commit. Running them as a pre-commit hook catches secrets before they reach version control — or an AI prompt.

For log files specifically, a simple script approach works well:

# Strip common credential patterns from a log file before pasting
# (adapt the patterns to your stack)
sed -E \
  -e 's/(password|secret|token|key)=[^ &]*/\1=REDACTED/gi' \
  -e 's/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/user@example.com/g' \
  -e 's/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/0.0.0.0/g' \
  production.log > sanitized.log

Sanitizing Log Files

Log files are one of the most common things developers share with AI — "here's the error, what's wrong?" — and one of the most commonly over-shared.

What Logs Often Contain

Full HTTP request headers, including Authorization tokens
SQL queries with embedded values (not just parameters)
User IDs, session IDs, or request IDs traceable to individuals
Internal service URLs, internal hostnames, internal IP addresses
Stack traces with internal file paths

What to Share vs What to Redact

The error message, stack trace structure, and timing are usually what matter. Values rarely do.

Good Mental Model

Share the shape of the problem, not the content of the data. "The query returns no rows when filtering by user_id = 12345" can become "the query returns no rows when filtering by a valid user ID." The AI can diagnose the query logic without knowing the actual ID.

Working With Customer Data

Sometimes the data itself is the problem — a specific customer record triggers a bug, a particular order causes a calculation to fail. You need to share something real to diagnose it.

Option 1: Synthetic Reconstruction

Recreate the data with fake values that preserve the structure causing the bug. If a null middle_name breaks your name formatter, create a test record with null — you don't need the actual customer's name.

Option 2: Minimal Reproduction

Isolate the minimal data structure that triggers the problem. Often this reveals what the AI actually needs — the schema and the edge case — rather than a full production record.

Option 3: Local Model for Production Data

If you genuinely need to work with real production data and the data is sensitive enough that it cannot leave your environment, a local model is the right tool. See the Local and Private AI Models guide for setup options.

When to Use a Local Model Instead

A local model running on your machine never sends data anywhere. It's the right choice when:

You're working under an NDA that restricts sharing code externally
Your organization's security policy prohibits sending code to external services
You're working with health data, financial data, or other heavily regulated information
You need to analyze actual production logs, queries, or payloads
You want to run AI assistance in an air-gapped or offline environment

Local models have real tradeoffs — they're smaller and less capable than frontier models. But for many coding tasks — explaining a function, suggesting variable names, generating tests for well-defined logic — they're more than adequate.

Building a Team Habit

Individual vigilance is fragile. The more reliable approach is a short team agreement:

Define a default position — which categories of data are always off-limits for external AI (e.g., production PII, credentials, NDA-covered code)?
Provide a sanitization pattern — a one-page reference with concrete before/after examples for your specific stack.
Name the alternative — if someone needs to work with sensitive data, what's the approved path? Local model? Sanitize first? Escalate?
Add it to onboarding — new developers should hear about this on day one, alongside the git workflow and deployment process.

Low-Effort Team Win

Add a one-paragraph data sanitization note to your developer handbook and link to it from the section that introduces AI coding tools. Most developers want to do the right thing — they just need to know what that is.

The Short Version

Scan for credentials before pasting anything — they appear in logs and stack traces, not just config files.
Replace PII with placeholder values. The AI needs the data shape, not the data itself.
Business logic and proprietary algorithms under NDA warrant a local model, not a workaround.
Log files are high-risk — strip auth headers, connection strings, and user IDs before sharing.
When in doubt: reproduce the problem with synthetic data. If you can't, use a local model.

→