What your developers are sending to their AI tools, and why your legal team would be upset.

Let's get the obvious out of the way: this article was written with the assistance of an AI tool. Specifically, Claude — made by Anthropic, one of the providers whose data handling practices this article is going to examine. The irony is deliberate and instructive. The author of this piece is a staunch critic of the AI industry's trajectory, uses AI tools under protest because the market now demands it, and is using one right now to write an article about the risks of using them. That tension is not unique to this author. It is the professional reality of every security practitioner, every engineering leader, and every developer working in 2026. You are using these tools. Your developers are using them more than you know. The question is not whether -- it is what they are sending, where it is going, and what happens to it when it gets there.

A note on format. This article uses a Q&A structure in sections where the most useful framing is the question a legal, security, or engineering leader would actually ask walking into this problem. The questions are not rhetorical. They are the questions you should already be asking.

§ 01

What is actually being sent

Before the legal and security implications, establish the factual baseline: what goes into an AI tool prompt, and how much of it is sensitive.

The obvious case is Samsung's semiconductor division in early 2023, where engineers pasted proprietary source code and confidential meeting notes directly into ChatGPT to fix bugs and generate summaries. Samsung's response was to implement restrictions and ultimately ban generative AI tools on company-owned devices and networks. The incident became a textbook example because it was disclosed. Most incidents are not.

The less obvious cases are the ones happening continuously, in every engineering organization running AI coding assistants, without anyone noticing because no single event is dramatic enough to trigger an incident response:

A developer asks an AI assistant to help debug a function. To get a useful answer, they paste the surrounding context — which includes the function signature, the data model it operates on, the API it calls, and the environment variable names it reads from. Some of those environment variables are credentials. None of this is intentional. All of it leaves the developer's machine.

A developer asks an AI assistant to review a pull request. The PR contains business logic that represents years of engineering investment and competitive differentiation. The AI assistant processes it, generates feedback, stores the interaction. The developer gets useful feedback. The proprietary logic is now in a third-party system's logs.

A developer uses an AI assistant to write a test for a function that processes customer PII. The test scaffolding includes representative data — not production data, the developer would never do that — but data realistic enough to be structurally identical to production records. Names, emails, identifiers. All of it goes in the prompt.

Q: How widespread is this?

GitGuardian's State of Secrets Sprawl 2026 report found 28.65 million new hardcoded secrets in public GitHub commits during 2025 — a 34% year-over-year increase representing the largest single-year jump ever recorded. AI-assisted commits showed a 3.2% secret-leak rate compared to a 1.5% baseline across all public GitHub commits, indicating that AI-generated code roughly doubles the baseline credential exposure rate.

That is not a rounding error. That is a measurable, documented, statistically significant increase in credential exposure directly attributable to AI-assisted development. AI-service credentials — API keys for LLM providers, embedding services, and AI platforms — increased 81% year-over-year in 2025, reaching over 1.2 million detected leaks.

In 2024, developers pushed over 23 million hardcoded secrets into public GitHub commits. Repositories using tools like Copilot and Claude showed much higher leak rates. The tools that are supposed to help developers write better code are, empirically, helping them leak more secrets.

§ 02

Where it goes

Q: When a developer pastes code into an AI tool, where does that data go?

Into the provider's infrastructure, where it is processed to generate a response. What happens after that is where the providers' language becomes carefully ambiguous.

The standard framing is a distinction between using your data to provide the service and using your data to train models. Most enterprise tiers of major AI tools offer contractual guarantees that your data will not be used for training. Most consumer and free tiers do not. The gap between these two positions is where most actual developer usage lives — developers using personal accounts, free tiers, or tools that their organization has not formally evaluated or contracted.

GitHub announced in March 2026 that from April 24 onward, interaction data from Copilot Free, Pro, and Pro+ users — specifically inputs, outputs, code snippets, and associated context — would be used to train and improve AI models unless users opted out. The framing deserves scrutiny: GitHub stated that Copilot does process code from private repositories when users are actively using Copilot, and that this interaction data could be used for model training unless the user opts out.

The distinction between "repository content at rest" and "interaction data generated while working in a private repository" is doing significant legal and technical work in that sentence. If a developer asks Copilot to refactor a proprietary algorithm and Copilot processes the input, generates output, and stores the interaction, the proprietary algorithm is in the training pipeline regardless of what label is applied to the storage category.

The community response to this announcement was 232 downvotes in GitHub's community discussion, with only one GitHub VP among the 39 commenters supporting the plan. Under EU GDPR standards, data processing consent must be freely given, specific, informed, and unambiguous — opt-out does not meet that bar.

GitHub's defense was to note that this is industry standard practice. In its FAQs, GitHub noted that Anthropic, JetBrains, and corporate parent Microsoft operate similar opt-out data use policies. This is accurate. It is also not an argument. The industry normalizing a practice that disadvantages users is not evidence that the practice is acceptable — it is evidence that the incentives of the providers are not aligned with the interests of the organizations whose data they are processing.

Q: Does opting out fix this?

Partially, for the training data question. Not at all for the processing question. Your data is processed to generate responses regardless of your training opt-out status. Processing means it traverses the provider's infrastructure, sits in request logs, and is potentially accessible to the provider's systems and staff under their data access policies. The opt-out governs one downstream use of that data. It does not govern the exposure that happens during processing.

For on-premises or private deployment options, the exposure surface shrinks but does not disappear. The model still processes the input. The infrastructure still logs requests. The data still leaves the developer's local environment.

§ 03

The IP question

Q: If a developer pastes proprietary code into an AI tool and that code ends up in the training data, what are the IP implications?

This is the question your legal team will ask, and the honest answer is: currently unresolved, actively litigated, and likely to remain that way for several years.

The IP exposure operates on two vectors.

The first is the outbound vector: your proprietary code, architecture, or business logic is used to train a model that is then used by your competitors. The training data becomes embedded in model weights in a distributed, non-recoverable way. You cannot audit whether your IP is present in a model. You cannot remove it if it is. The contractual protection here depends entirely on whether the provider's terms actually prohibit training on your data, whether those terms are enforceable in your jurisdiction, and whether you can prove that a model output derives from your specific training input — which is technically extremely difficult.

The second is the inbound vector: AI-generated code suggestions carry their own IP provenance questions. Models trained on public code repositories reproduce patterns from that code. The legal status of AI-generated code that derives from copyrighted training data is not settled. GitHub Copilot has faced litigation over exactly this question. If your engineering team is shipping AI-generated code, your organization may be incurring copyright liability without knowing it. Your legal team's exposure here is not hypothetical — it is a function of how much AI-generated code is in your codebase and how carefully its provenance has been tracked, which in most organizations is: a lot, and not at all.

Q: What about confidential information beyond code — architecture diagrams, internal documentation, incident postmortems, customer data?

All of it is subject to the same analysis. The Samsung incident was notable for including meeting notes alongside source code. The meeting notes are arguably more sensitive — they contain strategic intent, competitive analysis, product roadmap information, and personnel details. None of that has a clean IP framework. All of it is confidential. All of it goes into the same processing pipeline when a developer pastes it into a prompt.

The PII dimension is where the regulatory exposure becomes acute. A developer who pastes a database query result into an AI tool to debug a data processing issue may be transmitting customer names, email addresses, phone numbers, or health information to a third-party processor with no DPA in place, no consent from the data subjects, and no notification to the privacy team. Under GDPR, CCPA, HIPAA, and most other privacy frameworks, this is not a minor compliance gap. It is a reportable incident.

§ 04

The secrets problem

The credential leak data above is the quantitative frame. The qualitative reality is worse.

AI coding tools are trained on public codebases filled with hardcoded credentials, poor validation logic, and insecure defaults. When these tools generate suggestions, they often reproduce those same patterns. The tool that is supposed to help your developers write secure code is, in a measurable number of cases, suggesting the same insecure patterns it learned from codebases that already had the problem.

Researchers at Truffle Security scanned Common Crawl's December 2024 archive — covering 2.67 billion web pages — and discovered nearly 12,000 valid live secrets: API keys, passwords, and tokens for AWS, Mailchimp, Slack, GitHub, and more, embedded in HTML, JavaScript, and configuration files. This data is commonly used to train or fine-tune large language models, meaning models may inadvertently learn insecure patterns or even embed these credentials in outputs.

The failure mode is circular: secrets get committed to public repositories, those repositories are ingested into training data, the trained model suggests patterns that include or normalize credential handling that looks like what it learned, developers accept suggestions without scrutinizing the credential handling, more secrets get committed. The tool accelerates the cycle it was supposed to break.

The MCP (Model Context Protocol) ecosystem has added a new dimension to this. GitGuardian found 24,008 unique secrets in MCP-related configuration files on public GitHub, of which 2,117 remain valid credentials. A significant contributing structural factor is that official MCP quickstart documentation presents API keys hardcoded directly in configuration examples, creating patterns that developers replicate without always recognizing the security implication.

When the official documentation for a new integration standard demonstrates insecure credential handling, the insecure pattern becomes the default. This is not a developer error. It is a systemic failure in how the AI tooling ecosystem is being built and documented.

§ 05

The shadow AI problem

Q: We have an enterprise license for an approved AI tool. Are we covered?

For the developers using the approved tool, under the enterprise terms, for work-related tasks: partially. For everything else: no.

The ease of access to countless free and specialized AI tools encourages employees to bypass sanctioned software. Each of these unvetted platforms has its own data privacy policy, security posture, and vulnerability profile. Security teams have no visibility into what data is being shared, with which platform, or by whom.

Your enterprise license covers the tool you contracted. It does not cover the tool a developer discovered on a product hunt list last week, the AI assistant built into their personal IDE configuration, the browser extension that rewrites their prompts before sending them, or the consumer ChatGPT account they have been using since before your enterprise tool was available. The developer is not acting maliciously. They are using the tool that is most convenient for the task in front of them.

The organizational detection gap is significant. AI prompt content is not captured by standard DLP tools. CASB solutions that monitor web traffic can flag connections to known AI endpoints — but the list of AI endpoints is expanding faster than most security teams can track, and many AI integrations run inside developer tools that look like legitimate IDE traffic. AI interactions often remain invisible to standard security monitoring. Logs may not capture the content of prompts and responses, or the system may not label data as sensitive.

You do not know what your developers are sending. Unless you have built specific instrumentation for this problem, which almost no organization has, you are operating on the assumption that the tools you approved are the tools being used and that the data being sent is appropriately scoped. Both assumptions are wrong with high probability.

§ 06

The provider accountability problem

Q: Can we trust what the providers say about data handling?

You can trust that the current terms reflect the current policy. You cannot trust that the current policy will remain unchanged.

The history of AI provider data policies is a history of terms that expand over time, announced via blog post, with opt-out windows measured in days. GitHub's April 2026 change was announced March 25 and took effect April 24 — 30 days to identify the change, evaluate its implications, and make an organizational decision about opt-out configuration across all developer accounts. Organizations that do not have someone monitoring provider policy changes missed that window entirely.

OpenAI, Anthropic, Google, Microsoft, and GitHub have all revised their data use policies since their tools launched. The revisions have not consistently moved in the direction of greater user control. The pattern is: launch with restrictive terms to build enterprise trust, expand data collection as the user base grows and the competitive pressure to improve models increases, frame the expansion as an opt-out rather than an opt-in, cite industry standard practice when challenged.

This is not a conspiracy. It is the predictable output of incentive structures. These companies need training data. Your developers' prompts are training data. The contractual and legal frameworks that would prevent that data from being used are less robust than the marketing materials suggest, and the enforcement mechanisms available to an organization that believes its IP has been misused are limited and expensive.

Q: What should we actually do?

The honest answer is that there is no solution that preserves full AI tool utility while eliminating the data exposure risk. The exposure is structural. The tool's value proposition requires sending data to the tool. The mitigation is risk management, not risk elimination.

The practical controls, in rough order of impact:

Establish what your developers are actually using. Inventory AI tools in use across the organization — approved and unapproved. This means network monitoring, developer surveys, and IDE policy enforcement. You cannot manage what you have not enumerated.

Differentiate by data sensitivity. Not all prompts carry the same risk. A developer asking an AI tool to explain a sorting algorithm is not a meaningful exposure event. A developer pasting a database schema with customer identifiers into an AI tool is. Build guidance that distinguishes between these cases and gives developers clear rules for where the line is, rather than blanket policies that get ignored because they are unenforceable.

Treat credentials in AI-assisted code as a separate risk category. The data on AI-assisted commits doubling baseline secret leak rates is not an argument for banning AI tools. It is an argument for ensuring that secret scanning is running on every commit, that pre-commit hooks are enforced, and that the code review process has not been shortened to the point where AI-generated credential handling patterns escape without scrutiny.

Read the contracts. Your enterprise terms with AI providers contain data processing agreements, data retention policies, and training opt-out provisions. Most organizations have signed these agreements without legal review of the specific provisions that govern what happens to the code their developers send. If your legal team has not reviewed these terms in the last six months, they are operating on outdated information.

Accept that shadow AI is a policy problem, not a technology problem. You will not block your way to compliance. Developers will find routes around network-level AI blocks because the tools are genuinely useful and the incentive to use them is real. The control that works is policy plus culture: clear guidance on what can and cannot be sent, why the rule exists, and consequences for violation — combined with making the approved tools good enough that developers do not feel compelled to route around them.

The part where I acknowledge what I said at the beginning

This article was produced with AI assistance. The irony stands.

The position here is not that AI tools should not be used. It is that the industry has systematically obscured what using them costs — not in subscription fees, but in data. Your code, your architecture, your secrets, and your customers' information are the actual product being exchanged when your developers use these tools for free or at consumer pricing. The enterprise tier is an attempt to buy back control over that exchange at a price that reflects its value to the provider.

Whether that exchange is worth it is a business decision. The failure mode is not making the decision consciously — defaulting into tool adoption because the tools are useful and visible and the exposure is invisible, and discovering the cost in a legal proceeding, a breach notification, or a competitive intelligence briefing that quotes your internal architecture back to you.

Your developers are sending things right now. You should know what.