Secrets Management Early: Designing for Production Before You Think You Need It
I used to treat secrets management as a “later” problem until I got burned by it in a multi-tenant AI system.
TL;DR
- I built a multi-tenant API platform used by custom GPTs and realized secrets sprawl becomes a real risk almost immediately.
- I learned that early secrets discipline is cheaper than late-stage cleanup.
- This is for product engineers and founders building SaaS or AI-backed systems that will touch real customer data.
Goal
My goal with this lab was to rethink how I handle secrets from day one when building products that are meant to go to production, even if they start as scrappy experiments.
In my case, I was building a multi-tenant API layer on Vercel that exposes OpenAPI endpoints for custom GPTs. Those GPTs could trigger actions against data lakes, warehouses, and third-party services. That meant API keys, database credentials, signing secrets, and tenant-specific tokens were all in play much earlier than in a typical CRUD app.
Success for me meant three things. First, no secrets in source control, ever. Second, a system where rotating a secret would not require a full redeploy or code change. Third, a mental model and workflow that scaled from “solo builder” to “team with real compliance requirements.”
Context
I came into this with a bias toward speed. Like many builders, I had historically thrown secrets into .env files, wired them into Vercel or another host, and moved on. That works for a while. It even works in production for small projects. But the moment you add multi-tenancy and AI agents that can take actions, the blast radius changes.
In this system, each tenant could have:
- Their own API credentials for third-party services
- Their own data destinations
- Their own GPTs configured to call my APIs
That means secrets are not just “my secrets,” but also “their secrets that I store.” That’s a fundamentally different responsibility.
I also had constraints. I didn’t want to introduce massive operational overhead. I wasn’t ready to run HashiCorp Vault clusters or build a full zero-trust internal platform. I was using Vercel for deployment and managed databases for storage. I needed pragmatic solutions that fit a lean stack.
Prior art I leaned on included:
- Cloud provider secret managers
- 12-factor app methodology
- Security postmortems from public breaches
- Docs from AWS, GCP, and Vercel on secret handling
Approach
My approach evolved from “where do I store secrets?” to “how do secrets flow through the system?” That shift in thinking changed everything.
Instead of focusing only on storage, I started mapping:
- Where secrets originate
- Where they are stored
- Where they are used
- Where they might leak
I decided early on that I would separate:
- Platform-level secrets (my infrastructure)
- Tenant-level secrets (customer credentials)
- Ephemeral secrets (tokens, short-lived keys)
I also decided what I would NOT do. I would not build my own encryption scheme. I would not invent a homegrown vault. And I would not rely on developers to “just remember” good practices.
My strategy was to combine:
- Managed secret stores
- Strict environment separation
- Minimal secret surface area in code
- Aggressive rotation policies
Steps
1) Setup
I started by auditing what secrets even existed. That sounds trivial, but it’s surprisingly clarifying.
I listed:
- Database URLs
- JWT signing secrets
- OpenAI or LLM provider keys
- Email/SMS provider tokens
- Internal service-to-service tokens
- Tenant-provided API keys
Then I categorized them by sensitivity and rotation frequency.
For infrastructure, I used the hosting provider’s environment variable system as a baseline. On Vercel, that meant encrypted env vars scoped per environment (dev, preview, prod).
Checklist I followed:
- No secrets in repo
.envin.gitignore- Separate dev/prod credentials
- Principle of least privilege for API keys
2) Implementation
The first real change I made was to remove secrets from local config files that were shared across the team. Instead, each developer had their own local .env.local that never left their machine.
Then I centralized secret access behind small utility modules. Instead of calling process.env.X everywhere, I created a config layer. That let me validate required secrets at startup and fail fast.
I also introduced runtime validation. If a required secret was missing, the app refused to start. This prevented half-configured deployments from limping along in unsafe states.
For tenant secrets, I stored them encrypted at rest in the database using managed encryption features and strict access controls. I made sure they were never logged, never returned to clients, and only decrypted in narrow execution paths.
One big lesson: logging is a major leak vector. I added redaction logic so known secret fields were masked automatically.
3) Validation
I validated my setup by simulating failure and compromise scenarios.
For example, I would:
- Rotate a key and confirm the system still worked
- Remove a secret and ensure startup failed loudly
- Scan logs for accidental exposures
Example command to check env presence:
node -e "console.log(!!process.env.DATABASE_URL)"
Expected output:
true
I also used secret scanning tools in CI to catch accidental commits containing tokens or keys.
Results
What worked:
- Centralized config modules reduced mistakes
- Managed secret stores removed operational burden
- Early discipline saved refactor time later
What didn’t:
- Over-abstracting secrets too early slowed iteration
- Some dev friction from stricter rules
Gotchas / Notes
One edge case I hit was background jobs and serverless functions having slightly different env availability. I had to standardize how secrets were injected.
Tradeoff-wise, secret managers can add latency or cost. For many paths, caching secrets in memory after retrieval was a good balance.
Another note: humans are the weak link. Tooling helps, but culture matters. I documented patterns and made them default.
Next
Next steps for me include:
- Automated rotation pipelines
- Short-lived credentials everywhere possible
- Deeper audit logging on secret access