Robert Važan

Practical privacy with LLMs

Most paid cloud APIs for LLMs come with a guarantee that your data would not be used for training (and thus leaked), but I want to discuss various issues that make the matter more complicated. Local LLMs are the perfect, ideal solution, of course, but we need a practical and gradual path forward instead of this false dichotomy between presently unattainable local LLM ideal on one side and despair and resignation on the other side.

The Gemini mess

Let's start with Gemini, because Google made a serious mess of privacy with Gemini. Gemini, like other cloud LLMs, has an API that comes with a promise to not train on your data. Gemini also has a free public app that gathers data for training. But Gemini also has a free API tier that does collect your data.

You might think it is fair for Google to use data on the free API tier, but how do you know you are on a paid tier? API keys for the free and paid tiers look exactly the same. The same API key actually works with both paid and free tier. There's no way to configure the client to fail hard when not on the paid tier. It will silently continue to use the free tier until it leaks all your data.

Of course, given this risk, you might want to carefully configure your Gemini subscription. But there are more traps there. Google's Cloud Console is seemingly infinitely complicated. Did you associate your Gemini service with the right billing account? Did you associate payment method with the billing account? How do you know the setup works when it's an invoiced service and you are weeks away from the first payment? Data in Cloud Console reports is delayed by hours. If you don't see your test requests in reports, how do you know it's just a reporting delay and not a misconfiguration? What if the payment card cannot be charged? Will Google switch your API key to the free tier automatically and harvest all your data?

Google gives you cloud credit when you sign up. Does this credit count as free or paid tier? Or does it depend on whether you associated payment method with your account? Do you have to create a separate billing account without free credit to avoid the free tier and associated data harvesting?

The legal mess

OpenAI and Anthropic will give you a clear guarantee that data submitted to their APIs is not retained and not used for training. But even here, things are complicated. There is still up to a month of retention "for security reasons". And the privacy guarantee excludes legal obligations. OpenAI was just recently obligated by court to retain all chats without exception as possible evidence in a copyright infringement lawsuit started by some newspaper nobody cares about. Chances are these chats will make their way into the hands of the newspaper's lawyers and analysts who couldn't care less about confidentiality of your data. Your data might even end up being presented publicly in court.

All three companies mentioned so far are American. I am 100% sure all the data you send to their APIs is immediately forwarded straight to NSA. American government has means to coerce cloud companies to do so and it can also prohibit them from speaking about it. We don't know what NSA does with our data. It could be used for extortion or to embarrass you or it could be secretly handed over to your American competitor.

Finally, even though there is clear marketing message that data of paying customers wouldn't be used for anything, the actual legal language is far less clear. Legal language covering data privacy is long, full of exceptions, and weirdly specific. The specificity is a problem, because it opens the door to nonobvious gaps in the privacy guarantee. So for example, the service terms can prohibit training by the one subsidiary of the company that is party to the service agreement while silently permitting training by other subsidiaries. You wouldn't notice this without lawyer's help. And you wouldn't know it is happening until some future model happens to cough up your data verbatim.

The scoundrels

Crafting terms of service that let the provider steal your data is of course dirty business. Sometimes it's just lawyers being lawyers and considering only interests of their client. But I have no doubt many cloud companies are deliberately poking holes into their service terms in order to steal your data without having to admit it openly. This is especially the case with the smaller SaaS startups that wrap APIs of LLM vendors.

Many of these companies don't even bother to guarantee data privacy. They just wouldn't provide the service to you unless you agree to data collection. Or they offer some vague informal privacy guarantees while their terms clearly state the data is now theirs.

And then there's China. Chinese LLMs are now competitive, especially on price, and I am sure millions of people are switching to Chinese LLM APIs. I would never do that myself. There is zero respect for intellectual property in China. It's a matter of course they will keep all your data regardless of whether they admit it in their terms or not. They will even trade customer data among themselves. Plus of course, all your data will be sent to Chinese government to be used for IP theft, growing Chinese competitors to your business, and for influence operations around the world.

Does it matter for work on opensource projects?

I have a lot of opensource projects. I don't mind sharing that source code with LLMs, because it's already available to everyone. Prompts used during development are not public, but I wouldn't mind publishing them, including whole chat logs, if I can find a reasonable way to do so. So could I use data-sharing LLMs for opensource work?

The problem with this idea is that it's common to frequently transition between opensource and private tasks. There's no easy way to isolate the two from each other. If I were to accept data-leaking LLM for opensource work, it's inevitable it would be eventually accidentally used for private work too. And once the data is leaked, there's no way back.

What about local LLMs?

Local LLMs offer perfect privacy, but using local LLMs for everything is obviously impractical. The models that run on local hardware are too weak for a lot of tasks, including programming. Buying hardware to run a frontier model locally would be too expensive.

As of mid 2025, current limits for a reasonably priced local LLM setup are as follows:

That's nowhere near what frontier models need. Even if we consider that MoE models save bandwidth and compute, memory capacity of local hardware is still only a fraction of what frontier models need.

Local hardware will improve, but how long will that take? My guess is that it will take 5-10 years to run something equivalent to Claude Sonnet locally. And that will only happen if there is a big enough market for specialized local LLM hardware. If we were to proceed at the pace of regular PC hardware upgrades, we would get there in 20+ years.

Pragmatic approach

Being perfectionist about it is not a workable solution. If we cannot sacrifice productivity gains from LLMs and we cannot sacrifice a lot of money on local hardware, then the inevitable choice is to sacrifice some privacy. There is however no need to give up all privacy forever.

The easiest thing to substitute is software layered on top of LLM APIs. There's no need to use closed-source coding assistants or to use opaque cloud services that are just wrappers around LLM APIs. Opensource can do the same. Lots of people are working on opensource LLM tools. The only companies you have to share your data with are the LLM vendors.

Even among cloud-based LLMs, you can afford to be picky. I wouldn't touch the Chinese LLM APIs. You can near-shore easier tasks to a local vendor. For a European like me, that means Mistral. I am already doing that to an extent.

There are tasks that already run well on local hardware, specifically FIM (auto-complete in IDEs) and summarization, as long as you have decent enough hardware. My hardware and software stack is not there yet, so this is one direction, in which I will be improving my setup.

And finally, let's not forget that this is all temporary. There is a point of diminishing returns for every type of task. As local and regional LLMs improve, we can gradually transition tasks from frontier models (Claude, Gemini, GPT) to regional models (Mistral) and then to local hardware. In the meantime, some privacy and likely also some intellectual property will be lost, but I treat that as a limited and temporary cost and frontier LLMs are still worth it after accounting for this hidden cost.