Migrating Applications from Rancher to vSphere Kubernetes Service on VMware Cloud Foundation

by Ken Rider on May 1, 2026 in Kubernetes, Platform Engineering, VSphere

A practitioner’s guide to planning, executing, and validating a production-grade platform migration

Executive Summary

As organizations accelerate their private-cloud consolidation strategies, many are evaluating whether to remain on SUSE Rancher for Kubernetes orchestration or move to VMware’s first-party Kubernetes offering — vSphere Kubernetes Service (VKS) — delivered through VMware Cloud Foundation (VCF). The decision is rarely straightforward, but the migration path itself, once chosen, follows a repeatable pattern that this article documents end-to-end.

This post walks through the key phases of a Rancher-to-VKS migration: pre-migration assessment, cluster provisioning on VCF, workload portability, traffic cutover, and post-migration hardening. It draws on production experience across multi-cluster, multi-subscription environments running enterprise workloads with strict compliance requirements.

This guide assumes VCF 5.x with VKS (formerly Tanzu Kubernetes Grid on vSphere) and Rancher v2.7+. Concepts apply broadly, but specific CLI flags and API versions should be validated against your installed releases.

Why Teams Are Moving from Rancher to VKS

Rancher has been a popular choice for managing heterogeneous Kubernetes fleets, but several converging factors are driving enterprise teams to consolidate onto VKS within a VCF stack:

Single-vendor operational model:

VCF bundles vSphere, vSAN, NSX, and VKS into a single engineered system with unified lifecycle management via SDDC Manager. For teams already invested in VMware licensing, consolidating the Kubernetes control plane eliminates a separate Rancher management layer.

NSX network integration:

VKS clusters natively consume NSX for pod networking, load balancing (NSX or Avi), and micro-segmentation policies — capabilities that require third-party plugins or manual CNI configuration under Rancher.

vSAN storage classes out of the box:

StorageClasses backed by vSAN policies are provisioned automatically, eliminating the need to manage external CSI drivers for on-premises persistent volumes.

Integrated identity and RBAC:

VKS plugs directly into vCenter SSO and Active Directory, enabling consistent identity governance across virtual machines and Kubernetes workloads without maintaining a separate Rancher authentication backend.

Support lifecycle alignment:

VMware (now Broadcom) provides a single support SLA for VCF + VKS, which simplifies vendor engagement for regulated industries.

Phase 1: Pre-Migration Assessment

A thorough assessment prevents surprises mid-migration. Invest time here — it directly determines the scope of workload changes required.

1.1 Cluster and Workload Inventory

Begin by generating a complete inventory of your Rancher-managed clusters:

Run kubectl get nodes -o wide across every downstream cluster to document node counts, OS images, and Kubernetes versions.
Export all Namespaces, Deployments, StatefulSets, DaemonSets, ConfigMaps, Secrets, and CRDs using a tool such as Velero’s backup-describe or a simple kubectl get all -A -o yaml dump.
Identify workloads using Rancher-specific features: Project Network Isolation, Rancher monitoring (Prometheus Operator deployed by Rancher), Rancher Pipelines (deprecated), Fleet GitOps, and cluster-level PodSecurityPolicies.
Flag any use of Longhorn for persistent storage — this will require data migration to vSAN-backed PVCs.

1.2 Dependency Mapping

Map every external dependency each workload has outside the Kubernetes cluster boundary:

Ingress endpoints — hostnames, TLS certificates, upstream load balancer rules.
Egress rules — firewall policies, private DNS zones, service endpoints that reference cluster node IPs or ClusterIPs.
External datastores — databases, message queues, blob storage accounts.
Service mesh policies — if Istio or Linkerd is deployed, export VirtualServices, DestinationRules, and PeerAuthentication objects.

1.3 Gap Analysis: Rancher Features vs. VKS Equivalents

Rancher Feature	VKS / VCF Equivalent	Migration Effort
Fleet GitOps	Argo CD / Tanzu Mission Control Continuous Delivery	Low — reuse existing GitOps manifests
Rancher Monitoring	VCF Operations and Logging	Medium — re-deploy monitoring stack
Rancher Pipelines	GitHub Actions / Tekton / Azure DevOps	Low — pipelines are already external in modern stacks
Project Network Isolation	NSX Micro-segmentation / NetworkPolicy	Medium — translate policies to NSX constructs
Longhorn PVCs	vSAN StorageClass PVCs	High — data migration required
Rancher Auth (GitHub / LDAP)	vCenter SSO + AD group sync	Medium — RBAC role mapping required
RancherD / k3s edge	Not applicable (VKS is vSphere-native)	Scope exclusion — migrate to VCF Edge
Cluster Templates	VKS TanzuKubernetesCluster CRD / Supervisor cluster	Low — re-template in YAML

Phase 2: Prepare the VCF / VKS Target Environment

2.1 Supervisor Cluster and Namespace Configuration

VKS workload clusters are provisioned through the vSphere Supervisor — the control plane layer embedded in vCenter that exposes Kubernetes APIs for cluster lifecycle management.

Enable the Workload Management feature in vCenter under the target vSphere cluster. Assign a storage policy backed by your vSAN datastore to serve as the default for supervisor namespaces.
Create a vSphere Namespace per application team or environment tier (dev / staging / prod). Apply resource quotas (CPU, memory, storage) and storage policy bindings at the namespace level.
Configure the NSX Load Balancer (Avi/NSX ALB) integration so that Kubernetes Services of type LoadBalancer receive routable VIPs automatically.
Bind vCenter SSO groups to namespace roles using the vSphere Namespace permissions UI or the Supervisor RBAC API.

2.2 Provision TanzuKubernetesClusters

Each application workload cluster is described as a TanzuKubernetesCluster (TKC) custom resource applied to the Supervisor. Below is a representative manifest pattern:

apiVersion: run.tanzu.vmware.com/v1alpha3
kind: TanzuKubernetesCluster
metadata:
name: prod-cluster-01
namespace: ns-production
spec:
topology:
controlPlane:
replicas: 3
vmClass: best-effort-medium
storageClass: vsan-default
nodePools:
– name: worker-pool-01
replicas: 5
vmClass: best-effort-xlarge
storageClass: vsan-default
settings:
network:
cni:
name: antrea
services:
cidrBlocks: [“198.51.100.0/22”]
pods:
cidrBlocks: [“192.168.0.0/16”]

2.3 Integrate with Your GitOps and CI/CD Toolchain

Register the new TKC kubeconfig in your Argo CD or Flux management cluster just as you would any other downstream cluster. If you were using Rancher Fleet, the manifest repositories can be pointed directly at Argo CD ApplicationSets with minimal changes — the GitOps layer is cluster-agnostic.

For CI/CD pipelines (GitHub Actions, Azure DevOps, Tekton), update kubeconfig secrets to reference the new cluster API endpoints. Federated identity (OIDC / Workload Identity) is recommended over long-lived service account tokens.

Phase 3: Workload Migration

3.1 Stateless Workloads

Stateless services (Deployments without PVCs) are the simplest to migrate and should be tackled first to build team confidence and validate the pipeline before touching stateful workloads.

Export manifests from Rancher clusters using kubectl get <resource> -n <namespace> -o yaml and strip cluster-specific annotations (cattle.io/*, rancher.io/*) using a kustomize patch or sed pipeline.
Re-apply to the target TKC namespace. Validate pod scheduling, liveness/readiness probes, and HPA behavior.
Update image pull secrets if your container registry authentication differs between environments.
Verify NetworkPolicies are re-applied (Antrea enforces them natively on VKS).

3.2 Stateful Workloads and Persistent Volume Migration

PVC migration is the highest-risk step. The recommended approach for Longhorn-to-vSAN migrations uses Velero with a restic/kopia backend to snapshot and restore volume data:

Install Velero on the source Rancher cluster with an S3-compatible object store backend (MinIO, Azure Blob, or AWS S3) as the migration staging target.
Run velero backup create <backup-name> –include-namespaces <ns> –default-volumes-to-restic to capture both resource definitions and volume data.
Install Velero on the target TKC with the same backend configuration. Ensure the vSAN StorageClass is set as the default so Velero restores PVCs against it.
Run velero restore create –from-backup <backup-name>. Monitor the restore for PVC binding and volume population.
Validate data integrity at the application layer — do not rely solely on filesystem-level checksums for databases. Run application-native consistency checks (e.g., pg_dumpall, mongodump –check).

For large volumes (>500 GB), consider a staged approach: migrate the application in read-only mode, sync data via database replication or rsync, then perform a final cutover with a brief maintenance window to minimize RPO.

3.3 Ingress and Certificate Migration

If you are running cert-manager with ACME or an internal CA, the configuration is portable without changes to Issuer/ClusterIssuer resources. Key steps:

Deploy cert-manager on the target TKC and re-apply ClusterIssuer manifests.
Re-create Certificate resources — cert-manager will re-issue TLS certificates automatically. For short-lived ACME certs, this is seamless. For internal CA certs, ensure the CA secret is pre-seeded on the target cluster.
Update your external DNS / load balancer (NSX ALB) to point hostnames at new Ingress VIPs after validation.
If using Azure Front Door or a global load balancer, update the origin group backends after target cluster validation passes.

3.4 Service Mesh Migration (Istio)

If your Rancher workloads use Istio, the VKS target cluster should receive an equivalent Istio installation before workload migration:

Install Istio using istioctl or the Helm chart on the target TKC, matching the version in use on the source cluster.
Export and re-apply VirtualServices, DestinationRules, Gateways, and PeerAuthentication policies.
Label namespaces for sidecar injection (istio-injection=enabled) before workload pods are scheduled to avoid needing pod restarts post-migration.
Validate mTLS peer authentication and traffic management rules using istioctl analyze and kiali (if deployed).

Phase 4: Traffic Cutover Strategy

Never perform a hard cutover in production. Use a progressive traffic shift to reduce blast radius:

4.1 Blue/Green Cutover via DNS

The simplest cutover approach for most workloads is a DNS weight shift:

Run both clusters in parallel with equivalent workloads deployed and healthy.
Create a weighted DNS record (Route 53, Azure Traffic Manager, or NSX ALB GSLB) with 95% traffic to the Rancher cluster and 5% to the VKS cluster.
Monitor error rates, latency, and application logs on the VKS cluster for 24–48 hours.
Progressively shift weights: 50/50, then 10/90, then 0/100.
Keep the Rancher cluster running (read-only, no new deployments) for a rollback window of at least 7 days.

4.2 Observability During Cutover

Instrument the migration with targeted observability:

Deploy a Prometheus scrape job on both clusters and federate metrics into a central Thanos or Grafana Mimir instance so you can compare golden signals side-by-side.
Use Loki or your existing log aggregator to ingest application logs from the VKS cluster prior to cutover.
Set SLO-based alerts (error rate > 0.5%, p99 latency > 2x baseline) that auto-page during the weight-shift window.

Phase 5: Post-Migration Hardening

5.1 Security Posture Validation

Run a Kubernetes security assessment against the new TKC before declaring migration complete. A three-tool combination gives broad coverage:

Kubescape (ARMO) — scans against NSA/CISA, MITRE ATT&CK, and CIS Kubernetes benchmarks. Run: kubescape scan framework nsa –submit –cluster prod-cluster-01
Trivy — vulnerability scanning for container images and Kubernetes misconfigurations. Run: trivy k8s –report summary cluster
kube-bench — CIS Kubernetes Benchmark compliance validation. Run as a Job against each node pool.

VKS clusters provisioned through VCF inherit NSX micro-segmentation at the hypervisor layer, which provides an additional defence-in-depth boundary that is absent in typical Rancher deployments on bare-metal or non-NSX vSphere.

5.2 RBAC Rationalization

VKS RBAC should mirror your organization’s IAM model rather than reproducing Rancher project boundaries verbatim. Map vCenter SSO groups to Kubernetes ClusterRoles and Roles, and enforce least-privilege access. Audit unused Service Accounts and rotate tokens.

5.3 Decommission Rancher Infrastructure

Once all workloads are validated on VKS and the rollback window has passed:

Drain and cordon Rancher downstream cluster nodes.
Delete the downstream cluster registration from the Rancher management server.
Decommission the Rancher management server and its backing PostgreSQL / MySQL datastore.
Release IP address ranges, firewall rules, and load balancer VIPs previously used by Rancher-managed clusters.
Archive GitOps repositories used exclusively for Rancher Fleet and document the canonical Argo CD Application paths for all migrated workloads.

Common Pitfalls to Avoid

▸ Underestimating PVC migration time: Volume copy via restic is CPU- and network-intensive. Run a dry-run backup in staging to measure throughput before committing to a production maintenance window.

▸ Cattle annotations breaking Helm releases: Rancher injects cattle.io annotations and labels into many resources. Strip these before re-applying or Helm will detect drift and attempt to reconcile incorrectly.

▸ NSX IPAM pool exhaustion: VKS services of type LoadBalancer consume NSX ALB VIPs. Size your VIP pools before provisioning clusters — running out mid-migration causes service creation failures.

▸ Forgetting node-level OS hardening: VKS nodes run Photon OS or Ubuntu. Ensure your OS-level hardening playbooks (SELinux/AppArmor, auditd, FIPS mode) are applied via a post-provisioning Job or DaemonSet.

▸ Overlooking admission webhooks: Rancher installs its own admission webhooks. After migration, ensure OPA Gatekeeper, Kyverno, or equivalent policy engines are deployed on VKS before workloads arrive to avoid policy gaps.

Conclusion

Migrating from Rancher to VKS on VCF is a well-scoped project when approached methodically. The phased model — assess, provision, migrate stateless, migrate stateful, cut over progressively, harden — minimizes risk and keeps rollback options open at every step.

The most significant work is not in the Kubernetes manifest layer (which is largely portable) but in the storage migration and the operational model shift: replacing Rancher’s multi-cluster management UI with Supervisor-level lifecycle management, vCenter RBAC, and your GitOps toolchain as the new control plane.

Organizations that have made this shift consistently report simplified licensing conversations, reduced operational toil from eliminated translation layers, and improved network security posture from NSX native integration. The upfront migration investment pays dividends in long-term platform stability.

Ready to plan your migration? Capstone-S provides architecture assessments, migration runbooks, and hands-on engineering support for Rancher-to-VKS transitions. Reach out at insights@capstone-s.com or visit capstone-s.com/contact.

Insights