Why This Matters
Most AI services process your inputs on remote infrastructure. Anthropic, OpenAI, and others have data handling policies, but sending sensitive data still creates real risks:
- NDA and IP exposure — proprietary algorithms, unreleased product names, and business logic can appear in prompts without you realising.
- Personal data regulations — GDPR, CCPA, and similar laws apply to personal data you control. Submitting it to a third party requires a lawful basis and, usually, a data processing agreement.
- Credential leaks — API keys, database connection strings, and tokens pasted from config files or logs are a persistent risk.
- Customer trust — your users gave you their data, not your toolchain.
None of this means AI tools are off-limits. It means developing a habit: scan before you paste.
The Four Categories to Watch
1. Credentials and Secrets
These are the highest-risk items because they are immediately actionable if leaked.
- API keys (
sk-...,AKIA..., bearer tokens) - Database connection strings (
postgres://user:password@host/db) - Private keys, certificates, JWT secrets
- OAuth client secrets, webhook signing keys
- Anything inside
.envfiles
Credentials often appear in stack traces and logs, not just config files. A connection refused error can include the full database URL. A failed HTTP request can include an Authorization header. Always check log snippets before pasting.
2. Personally Identifiable Information (PII)
Any data that identifies a real person falls into this category:
- Names, email addresses, phone numbers
- IP addresses (often PII under GDPR)
- User IDs or session tokens that map to real people
- Health or financial data — doubly regulated
- Geolocation data
A common pattern: you're debugging a failing query and paste the query with real data from a recent test. The query itself is fine to share; the embedded customer email is not.
3. Business Logic and Trade Secrets
Harder to define, but worth considering for anything under NDA or in an unreleased product:
- Pricing algorithms, margin calculations
- Unreleased feature names or product roadmap signals
- Proprietary scoring models
- Internal system names, codenames, or architecture diagrams
The test: would you paste this in a public GitHub issue? If not, it may not belong in an AI prompt either.
4. Customer Data in Sample Payloads
Developers frequently paste JSON payloads, database rows, or API responses to ask AI for help parsing or transforming them. The shape of the data is usually what you need — not the values.
// Don't paste this to ask about schema design:
{
"user_id": "u_4829",
"email": "jane.smith@acme.com",
"created_at": "2024-03-15",
"balance": 4200.00
}
// Paste this instead:
{
"user_id": "u_XXXXX",
"email": "user@example.com",
"created_at": "YYYY-MM-DD",
"balance": 0.00
}
Sanitizing Code Before Pasting
Quick Manual Pass (30 seconds)
Before pasting any code file or snippet, do a quick scan for:
- Anything hardcoded that looks like a secret — search for
=followed by a long string - Real email addresses, IDs, or names in comments, test data, or constants
- Connection strings in config imports or environment access patterns
Replace with placeholders: YOUR_API_KEY, user@example.com, REDACTED. You don't need to anonymize variable names or business logic — just the values.
Automated Pre-Scan (for teams)
Tools like TruffleHog and Gitleaks can scan files for credentials before commit. Running them as a pre-commit hook catches secrets before they reach version control — or an AI prompt.
For log files specifically, a simple script approach works well:
# Strip common credential patterns from a log file before pasting
# (adapt the patterns to your stack)
sed -E \
-e 's/(password|secret|token|key)=[^ &]*/\1=REDACTED/gi' \
-e 's/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/user@example.com/g' \
-e 's/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/0.0.0.0/g' \
production.log > sanitized.log
Sanitizing Log Files
Log files are one of the most common things developers share with AI — "here's the error, what's wrong?" — and one of the most commonly over-shared.
What Logs Often Contain
- Full HTTP request headers, including Authorization tokens
- SQL queries with embedded values (not just parameters)
- User IDs, session IDs, or request IDs traceable to individuals
- Internal service URLs, internal hostnames, internal IP addresses
- Stack traces with internal file paths
What to Share vs What to Redact
The error message, stack trace structure, and timing are usually what matter. Values rarely do.
Share the shape of the problem, not the content of the data. "The query returns no rows when filtering by user_id = 12345" can become "the query returns no rows when filtering by a valid user ID." The AI can diagnose the query logic without knowing the actual ID.
Working With Customer Data
Sometimes the data itself is the problem — a specific customer record triggers a bug, a particular order causes a calculation to fail. You need to share something real to diagnose it.
Option 1: Synthetic Reconstruction
Recreate the data with fake values that preserve the structure causing the bug. If a null middle_name breaks your name formatter, create a test record with null — you don't need the actual customer's name.
Option 2: Minimal Reproduction
Isolate the minimal data structure that triggers the problem. Often this reveals what the AI actually needs — the schema and the edge case — rather than a full production record.
Option 3: Local Model for Production Data
If you genuinely need to work with real production data and the data is sensitive enough that it cannot leave your environment, a local model is the right tool. See the Local and Private AI Models guide for setup options.
When to Use a Local Model Instead
A local model running on your machine never sends data anywhere. It's the right choice when:
- You're working under an NDA that restricts sharing code externally
- Your organization's security policy prohibits sending code to external services
- You're working with health data, financial data, or other heavily regulated information
- You need to analyze actual production logs, queries, or payloads
- You want to run AI assistance in an air-gapped or offline environment
Local models have real tradeoffs — they're smaller and less capable than frontier models. But for many coding tasks — explaining a function, suggesting variable names, generating tests for well-defined logic — they're more than adequate.
Building a Team Habit
Individual vigilance is fragile. The more reliable approach is a short team agreement:
- Define a default position — which categories of data are always off-limits for external AI (e.g., production PII, credentials, NDA-covered code)?
- Provide a sanitization pattern — a one-page reference with concrete before/after examples for your specific stack.
- Name the alternative — if someone needs to work with sensitive data, what's the approved path? Local model? Sanitize first? Escalate?
- Add it to onboarding — new developers should hear about this on day one, alongside the git workflow and deployment process.
Add a one-paragraph data sanitization note to your developer handbook and link to it from the section that introduces AI coding tools. Most developers want to do the right thing — they just need to know what that is.
The Short Version
- Scan for credentials before pasting anything — they appear in logs and stack traces, not just config files.
- Replace PII with placeholder values. The AI needs the data shape, not the data itself.
- Business logic and proprietary algorithms under NDA warrant a local model, not a workaround.
- Log files are high-risk — strip auth headers, connection strings, and user IDs before sharing.
- When in doubt: reproduce the problem with synthetic data. If you can't, use a local model.
Related Guides
Local and Private AI Models
When data can't leave your environment: Ollama, LM Studio, and local model selection for privacy-constrained development.
Manual Ch. 11 — Security and Risks
SQL injection, XSS, hardcoded secrets, and the security checklist for AI-generated code.
Claude API for Developers
Building responsibly with the API: data handling, system prompts, and production patterns.