Networking – Knowing the Best Practices, FAQs, and Common Pitfalls

The network communication with workloads deployed on your SDDC is a key part of the overall user experience and, probably, one of the most complex design sections. Network configuration is under the organization’s control; VMware only provides underlying network connectivity with the hardware AWS infrastructure.

Let’s highlight the most common network misconfigurations:

  • Insufficient connection between on-premises and the VMware Cloud on AWS SDDC

It’s a common practice to initially configure an IPSec VPN over the internet to achieve basic connectivity between on-premises and the SDDC and to secure the traffic flow. However, a VPN tunnel over the internet is not suitable for a mass migration of the workload. Live vMotion over the internet is not supported. Unpredictable bandwidth and latency affect the migration timeline making it unpredictable. For a large-scale migration and/or a hybrid cloud use case, you need to plan for a dedicated private connection to your SDDC.

  • Underestimating Level 2 network extension complexity

HCX and/or NSX Standalone Edge provide a unique feature – the ability to stretch a Layer 2 broadcast domain for a selected VLAN and allow the workload to retain the original IP addresses. This feature enormously helps to seamlessly migrate applications without an impact on the client configuration. On the other hand, this feature has several trade-offs, impacting workload availability and/or performance:

  • For workloads deployed on a Layer 2 extended segment (even with the MON feature enabled), all traffic sent to destinations residing outside of the SDDC network will first reach the default gateway, located on-premises. It may cause unexpected high latency when accessing workloads residing in native AWS VPC, including the connected VPC.
    • Workloads have a clear dependency on the on-premises default gateway. If the link between on-premises and the SDDC stops functioning, the workload on the extended leg of the segment would not be able to reach the default gateway and communicate with the external destination.
    • Undersized HCX Layer 2 extension appliances: All broadcast traffic within the VLAN must traverse the extension appliances on both sides of the tunnel. If the appliance is overloaded and/or does not have enough resources, the workload residing in the SDDC drops all external connections. This scenario is often observed with entry-level clusters based on the i3.metal host type. You can scale out and deploy multiple extension appliance pairs and distribute extended segments between appliances.
    • Extension appliance availability: As mentioned earlier, the Layer 2 extension has a direct dependency on the HCX appliance. If the appliance stops working, becomes corrupted, or restarts, the network communication is affected. If you plan to maintain the extension after the migration is complete, use the HA feature of HCX extension appliances. Bear in mind that for a complex environment with a lot of extended VLANs, configuring HA will reduce compute and storage resources on both sides of the environment, including the SDDC. You may need to scale out the vSphere cluster hosting appliance on the SDDC side, incurring additional costs.
    • Security concerns: Many security teams tend not to allow a Layer 2 extension over the public internet as it poses security risks and exposes sensitive broadcast traffic to the internet. When not properly addressed in the design phase, it might drastically affect your migration plans if you were planning to live migrate and retain the IP addresses. The best solution is to use a dedicated DX line and pass the extension traffic over the DX, which must address most of the concerns of the security team.
  • Identify network dependencies after migration

Many organizations claim that performance suffers after migrating workloads to the cloud. Some of these concerns are due to not following the best practices while migrating; however, in many cases, it has nothing to do with the SDDC. For a complex distributed application when not all components were properly identified and migrated to the cloud, the traffic may have additional hops traversing the WAN link(s), adding not foreseen latency to the application. An example of this is a migration of a SQL Server database warehouse, where the centralized integration service (SSIS) was left on premises, causing all the data to be first moved back to on-premises and then retransmitted to the SDDC. The impact of this configuration on the application was measured at a 300% increase in the OLAP cube generation time. The troubleshooting and search for affected traffic flows may be a complex and time-consuming task. VMware Aria Operations for Networks can help you visualize the traffic flow for a selected application.