Security Guide

Sanitizing Code and Data Before Sending to AI

AI assistants are genuinely useful for debugging, refactoring, and reviewing code — but the moment you paste a real database query, a log file, or a customer record, you're sending data to a third-party service. This guide explains what to scrub, how to scrub it, and when a local model is the right choice instead.

Last reviewed: May 26 2026


Why This Matters

Most AI services process your inputs on remote infrastructure. Anthropic, OpenAI, and others have data handling policies, but sending sensitive data still creates real risks:

None of this means AI tools are off-limits. It means developing a habit: scan before you paste.


The Four Categories to Watch

1. Credentials and Secrets

These are the highest-risk items because they are immediately actionable if leaked.

Non-Obvious Risk

Credentials often appear in stack traces and logs, not just config files. A connection refused error can include the full database URL. A failed HTTP request can include an Authorization header. Always check log snippets before pasting.

2. Personally Identifiable Information (PII)

Any data that identifies a real person falls into this category:

A common pattern: you're debugging a failing query and paste the query with real data from a recent test. The query itself is fine to share; the embedded customer email is not.

3. Business Logic and Trade Secrets

Harder to define, but worth considering for anything under NDA or in an unreleased product:

The test: would you paste this in a public GitHub issue? If not, it may not belong in an AI prompt either.

4. Customer Data in Sample Payloads

Developers frequently paste JSON payloads, database rows, or API responses to ask AI for help parsing or transforming them. The shape of the data is usually what you need — not the values.

// Don't paste this to ask about schema design:
{
  "user_id": "u_4829",
  "email": "jane.smith@acme.com",
  "created_at": "2024-03-15",
  "balance": 4200.00
}

// Paste this instead:
{
  "user_id": "u_XXXXX",
  "email": "user@example.com",
  "created_at": "YYYY-MM-DD",
  "balance": 0.00
}

Sanitizing Code Before Pasting

Quick Manual Pass (30 seconds)

Before pasting any code file or snippet, do a quick scan for:

  1. Anything hardcoded that looks like a secret — search for = followed by a long string
  2. Real email addresses, IDs, or names in comments, test data, or constants
  3. Connection strings in config imports or environment access patterns

Replace with placeholders: YOUR_API_KEY, user@example.com, REDACTED. You don't need to anonymize variable names or business logic — just the values.

Automated Pre-Scan (for teams)

Tools like TruffleHog and Gitleaks can scan files for credentials before commit. Running them as a pre-commit hook catches secrets before they reach version control — or an AI prompt.

For log files specifically, a simple script approach works well:

# Strip common credential patterns from a log file before pasting
# (adapt the patterns to your stack)
sed -E \
  -e 's/(password|secret|token|key)=[^ &]*/\1=REDACTED/gi' \
  -e 's/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/user@example.com/g' \
  -e 's/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/0.0.0.0/g' \
  production.log > sanitized.log

Sanitizing Log Files

Log files are one of the most common things developers share with AI — "here's the error, what's wrong?" — and one of the most commonly over-shared.

What Logs Often Contain

What to Share vs What to Redact

The error message, stack trace structure, and timing are usually what matter. Values rarely do.

Good Mental Model

Share the shape of the problem, not the content of the data. "The query returns no rows when filtering by user_id = 12345" can become "the query returns no rows when filtering by a valid user ID." The AI can diagnose the query logic without knowing the actual ID.


Working With Customer Data

Sometimes the data itself is the problem — a specific customer record triggers a bug, a particular order causes a calculation to fail. You need to share something real to diagnose it.

Option 1: Synthetic Reconstruction

Recreate the data with fake values that preserve the structure causing the bug. If a null middle_name breaks your name formatter, create a test record with null — you don't need the actual customer's name.

Option 2: Minimal Reproduction

Isolate the minimal data structure that triggers the problem. Often this reveals what the AI actually needs — the schema and the edge case — rather than a full production record.

Option 3: Local Model for Production Data

If you genuinely need to work with real production data and the data is sensitive enough that it cannot leave your environment, a local model is the right tool. See the Local and Private AI Models guide for setup options.


When to Use a Local Model Instead

A local model running on your machine never sends data anywhere. It's the right choice when:

Local models have real tradeoffs — they're smaller and less capable than frontier models. But for many coding tasks — explaining a function, suggesting variable names, generating tests for well-defined logic — they're more than adequate.


Building a Team Habit

Individual vigilance is fragile. The more reliable approach is a short team agreement:

  1. Define a default position — which categories of data are always off-limits for external AI (e.g., production PII, credentials, NDA-covered code)?
  2. Provide a sanitization pattern — a one-page reference with concrete before/after examples for your specific stack.
  3. Name the alternative — if someone needs to work with sensitive data, what's the approved path? Local model? Sanitize first? Escalate?
  4. Add it to onboarding — new developers should hear about this on day one, alongside the git workflow and deployment process.
Low-Effort Team Win

Add a one-paragraph data sanitization note to your developer handbook and link to it from the section that introduces AI coding tools. Most developers want to do the right thing — they just need to know what that is.


The Short Version

Related Guides

Local and Private AI Models

When data can't leave your environment: Ollama, LM Studio, and local model selection for privacy-constrained development.

Manual Ch. 11 — Security and Risks

SQL injection, XSS, hardcoded secrets, and the security checklist for AI-generated code.

Claude API for Developers

Building responsibly with the API: data handling, system prompts, and production patterns.

Back to Home