“The best way to find out if you can trust somebody is to trust them.” - Ernest Hemingway. Sometimes in DevOps, the best way to find configuration issues is to trust the system will tell you exactly what’s wrong.
Troubleshooting Journey
This troubleshooting session demonstrates a systematic approach to resolving complex Kubernetes certificate management issues. The problem started with a simple error message but revealed multiple configuration mismatches that required careful investigation and coordinated fixes.
📋 Complete Thinking Logs & Command History (Click to expand)
``` > Think harder, there must be naming issue, the webapp-gateway deployed in istio-system namespace, couldn't find secret named istio-system/cuizhanming-com-tls for TLS. ✻ Initial Analysis - webapp-gateway in istio-system namespace can't find secret "istio-system/cuizhanming-com-tls" - Could be: secret doesn't exist, wrong name, wrong namespace, or gateway misconfiguration - Need to investigate cluster state and cert-manager configuration ⏺ kubectl get secrets -n istio-system ⎿ Found: cuizhanming-com-cert-nmpzb (Opaque type) - not the expected TLS secret Missing: cuizhanming-com-tls (kubernetes.io/tls type) ✻ Root Cause Discovery - No kubernetes.io/tls type secrets in istio-system namespace - cert-manager created temporary Opaque secret instead of final TLS secret - Certificate resource exists but stuck in "False" ready state ⏺ kubectl describe certificate cuizhanming-com-cert -n istio-system ⎿ Certificate in "Issuing" state with ACME challenges pending ⏺ kubectl describe challenge cuizhanming-com-cert-1-3026261606-2464681531 -n istio-system ⎿ RBAC Error: "linode.acme.slicen.com is forbidden: User 'system:serviceaccount:cert-manager:cert-manager' cannot create resource 'linode' in API group 'acme.slicen.com'" ✻ Configuration Mismatch Analysis - ClusterRole permissions: *.acme.slicen.me - ClusterIssuer configuration: acme.slicen.com - APIService registration: v1alpha1.acme.slicen.me - Domain extensions don't match (.com vs .me) ✻ Solution Strategy - Option 1: Fix APIService to use .com (managed by Helm, would get overwritten) - Option 2: Fix ClusterIssuer template to use .me (better long-term solution) - Choose Option 2: Update local template to match webhook ⏺ Edit kubernetes/helm/cert-manager-config/templates/letsencrypt-production.yaml ⎿ Changed: groupName: acme.slicen.com → groupName: acme.slicen.me ⏺ kubectl patch clusterrole cert-manager-webhook-linode:domain-solver ⏺ helm upgrade cert-manager-config ./kubernetes/helm/cert-manager-config -n cert-manager ⏺ helm upgrade istio-gateway-api ./kubernetes/helm/istio-gateway-api -n istio-system ✻ Additional Secret Issues Discovered - Webhook expected: linode-credentials secret with "token" key - Configured: linode-api-token secret with "api-token" key - Created workaround secret with expected format ⏺ kubectl create secret generic linode-credentials -n cert-manager --from-literal=token=... ⏺ kubectl delete pod -n cert-manager -l app=cert-manager-webhook-linode ✅ Final Success ⎿ DNS-01 challenges now working: "Waiting for DNS-01 challenge propagation" System functioning normally, certificates will be issued after DNS propagation ``` **Key Commands Used:** - `kubectl get secrets -n istio-system --field-selector type=kubernetes.io/tls` - `kubectl describe certificate cuizhanming-com-cert -n istio-system` - `kubectl get challenges -n istio-system` - `kubectl describe clusterrole cert-manager-webhook-linode:domain-solver` - `kubectl get apiservice | grep linode` - `helm upgrade cert-manager-config ./kubernetes/helm/cert-manager-config -n cert-manager`Initial Problem Investigation
Problem Statement: webapp-gateway deployed in istio-system namespace couldn’t find secret named istio-system/cuizhanming-com-tls
for TLS.
First Discovery: Found secret cuizhanming-com-cert-nmpzb
(Opaque type) instead of expected TLS secret
- No
kubernetes.io/tls
type secrets existed in istio-system namespace - Gateway configuration expected
cuizhanming-com-tls
but found different naming
Gateway Analysis: Examined webapp-gateway configuration revealed exact issue
- Status message: “invalid certificate reference /Secret/cuizhanming-com-tls., secret istio-system/cuizhanming-com-tls not found”
- Confirmed naming mismatch between expected and actual secrets
Certificate Management Investigation
cert-manager Discovery: Found Certificate resource cuizhanming-com-cert
in “False” ready state
- Certificate was configured to create
cuizhanming-com-tls
secret (correct name) - Certificate stuck in “Issuing” state with temporary secret
cuizhanming-com-cert-nmpzb
ACME Challenge Analysis: CertificateRequest approved but not ready
- Challenges were pending:
dev.cuizhanming.com
andapi.cuizhanming.com
- Root cause identified: RBAC permission errors
Root Cause Analysis
RBAC Permission Issue: Challenge failing with forbidden error
"linode.acme.slicen.com is forbidden: User 'system:serviceaccount:cert-manager:cert-manager'
cannot create resource 'linode' in API group 'acme.slicen.com' at the cluster scope"
Domain Mismatch Discovery: ClusterRole had wrong domain permissions
- ClusterRole: permissions for
*.acme.slicen.me
- ClusterIssuer: configured for
acme.slicen.com
- APIService: registered for
v1alpha1.acme.slicen.me
Configuration Inconsistency: Three components using different domains
- APIService:
acme.slicen.me
(from webhook Helm chart) - ClusterIssuer:
acme.slicen.com
(from local template) - ClusterRole:
acme.slicen.me
(after investigation)
Solution Implementation
Strategy Decision: Fix ClusterIssuer template to match webhook APIService domain
- Updated
kubernetes/helm/cert-manager-config/templates/letsencrypt-production.yaml
- Changed
groupName: acme.slicen.com
→groupName: acme.slicen.me
Coordinated Fixes Applied:
- Patched ClusterRole permissions to use correct domain
- Upgraded cert-manager-config Helm chart with corrected template
- Recreated certificate by upgrading istio-gateway-api Helm chart
- Resolved webhook secret naming issues
Secret Management Issues: Webhook expected different secret name/key
- Expected:
linode-credentials
secret withtoken
key - Configured:
linode-api-token
secret withapi-token
key - Solution: Created properly formatted secret and restarted webhook
Verification and Success
Final Status: DNS-01 challenges working correctly
cuizhanming-com-cert-1-3026261606-1038090168 pending dev.cuizhanming.com Waiting for DNS-01 challenge propagation
cuizhanming-com-cert-1-3026261606-2824149092 pending api.cuizhanming.com Waiting for DNS-01 challenge propagation
Pipeline Recommendation: Linode API token should be stored as:
- Namespace:
cert-manager
- Secret Name:
linode-api-token
- Key:
api-token
- Command:
kubectl create secret generic linode-api-token --from-literal=api-token="$LINODE_API_TOKEN" -n cert-manager
Explore the topic following what why how 3 questions
What
- webapp-gateway in istio-system namespace couldn't find cuizhanming-com-tls secret for TLS
- cert-manager Certificate resource was stuck in "False" ready state
- ACME challenges were failing with RBAC permission errors
Why
- Domain mismatch: ClusterIssuer used acme.slicen.com but webhook APIService was registered for acme.slicen.me
- Secret naming conflict: Webhook expected linode-credentials but ClusterIssuer referenced linode-api-token
- Key naming issue: Webhook looked for "token" key but secret contained "api-token" key
How
- Fixed ClusterIssuer template to use correct domain (acme.slicen.me)
- Updated Helm releases: cert-manager-config and istio-gateway-api charts
- Created proper secret with expected name and key format in cert-manager namespace
- Restarted webhook pod to refresh configuration and pick up new secrets
Reference
- Kubernetes TLS Secret Documentation
- cert-manager ClusterIssuer Configuration Guide
- Istio Gateway TLS Configuration Best Practices
Click to load Disqus comments