You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Wido den Hollander <wi...@widodh.nl> on 2018/10/23 06:23:19 UTC

VXLAN and KVm experiences

Hi,

I just wanted to know if there are people out there using KVM with
Advanced Networking and using VXLAN for different networks.

Our main goal would be to spawn a VM and based on the network the NIC is
in attach it to a different VXLAN bridge on the KVM host.

It seems to me that this should work, but I just wanted to check and see
if people have experience with it.

Wido

Re: VXLAN and KVm experiences

Posted by Nux! <nu...@li.nux.ro>.
+1 VXLAN works just fine in my testing, the only gotcha I ever hit as Si mentioned is setting an IP address of sorts on the interface.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Simon Weller" <sw...@ena.com.INVALID>
> To: "dev" <de...@cloudstack.apache.org>
> Sent: Tuesday, 23 October, 2018 12:51:17
> Subject: Re: VXLAN and KVm experiences

> We've also been using VXLAN on KVM for all of our isolated VPC guest networks
> for quite a long time now. As Andrija pointed out, make sure you increase the
> max_igmp_memberships param and also put an ip address on each interface host
> VXLAN interface in the same subnet for all hosts that will share networking, or
> multicast won't work.
> 
> 
> - Si
> 
> 
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 5:21 AM
> To: dev@cloudstack.apache.org
> Subject: Re: VXLAN and KVm experiences
> 
> 
> 
> On 10/23/18 11:21 AM, Andrija Panic wrote:
>> Hi Wido,
>>
>> I have "pioneered" this one in production for last 3 years (and suffered a
>> nasty pain of silent drop of packages on kernel 3.X back in the days
>> because of being unaware of max_igmp_memberships kernel parameters, so I
>> have updated the manual long time ago).
>>
>> I never had any issues (beside above nasty one...) and it works very well.
> 
> That's what I want to hear!
> 
>> To avoid above issue that I described - you should increase
>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>> with more than 20 vxlan interfaces, some of them will stay in down state
>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>> pay attention to MTU size as well - anyway everything is in the manual (I
>> updated everything I though was missing) - so please check it.
>>
> 
> Yes, the underlying network will all be 9000 bytes MTU.
> 
>> Our example setup:
>>
>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>> - so this is defined as KVM traffic label. In our case it didn't make sense
>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>> test it on your own - since this bridge is used only to extract child
>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>> (and then of course vNIC goes to this new bridge), so original bridge (to
>> which bond0.xxx belonged) is not used for anything.
>>
> 
> Clear, I indeed thought something like that would happen.
> 
>> Here is sample from above for vxlan 867 used for tenant isolation:
>>
>> root@hostname:~# brctl show brvx-867
>>
>> bridge name     bridge id               STP enabled     interfaces
>> brvx-867                8000.2215cfce99ce       no              vnet6
>>
>>      vxlan867
>>
>> root@hostname:~# ip -d link show vxlan867
>>
>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>
>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>
>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>> bond0.950 parent interface (which could affects traffic inside VM) - so
>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>> above with minimum of 1550 MTU)
>>
> 
> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> networks underneath will be ~9k.
> 
>> Ping me if more details needed, happy to help.
>>
> 
> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> experiences later.
> 
> Wido
> 
>> Cheers
>> Andrija
>>
>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>
>>> Hi,
>>>
>>> I just wanted to know if there are people out there using KVM with
>>> Advanced Networking and using VXLAN for different networks.
>>>
>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>> in attach it to a different VXLAN bridge on the KVM host.
>>>
>>> It seems to me that this should work, but I just wanted to check and see
>>> if people have experience with it.
>>>
>>> Wido
>>>
>>

Re: VXLAN and KVm experiences

Posted by Simon Weller <sw...@ena.com.INVALID>.
Yeah, being able to handle EVPN within ACS via FRR would be awesome. FRR has added a lot of features since we tested it last. We were having problems with FRR honouring route targets and dynamically creating routes based on labels. If I recall, it was related to LDP  9.3 not functioning correctly.


________________________________
From: Ivan Kudryavtsev <ku...@bw-sw.com>
Sent: Tuesday, October 23, 2018 7:54 AM
To: dev
Subject: Re: VXLAN and KVm experiences

Doesn't solution like this works seamlessly for large VXLAN networks?

https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn

вт, 23 окт. 2018 г., 8:34 Simon Weller <sw...@ena.com.invalid>:

> Linux native VXLAN uses multicast and each host has to participate in
> multicast in order to see the VXLAN networks. We haven't tried using PIM
> across a L3 boundary with ACS, although it will probably work fine.
>
> Another option is to use a L3 VTEP, but right now there is no native
> support for that in CloudStack's VXLAN implementation, although we've
> thought about proposing it as feature.
>
>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 7:17 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 1:51 PM, Simon Weller wrote:
> > We've also been using VXLAN on KVM for all of our isolated VPC guest
> networks for quite a long time now. As Andrija pointed out, make sure you
> increase the max_igmp_memberships param and also put an ip address on each
> interface host VXLAN interface in the same subnet for all hosts that will
> share networking, or multicast won't work.
> >
>
> Thanks! So you are saying that all hypervisors need to be in the same L2
> network or are you routing the multicast?
>
> My idea was that each POD would be an isolated Layer 3 domain and that a
> VNI would span over the different Layer 3 networks.
>
> I don't like STP and other Layer 2 loop-prevention systems.
>
> Wido
>
> >
> > - Si
> >
> >
> > ________________________________
> > From: Wido den Hollander <wi...@widodh.nl>
> > Sent: Tuesday, October 23, 2018 5:21 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: VXLAN and KVm experiences
> >
> >
> >
> > On 10/23/18 11:21 AM, Andrija Panic wrote:
> >> Hi Wido,
> >>
> >> I have "pioneered" this one in production for last 3 years (and
> suffered a
> >> nasty pain of silent drop of packages on kernel 3.X back in the days
> >> because of being unaware of max_igmp_memberships kernel parameters, so I
> >> have updated the manual long time ago).
> >>
> >> I never had any issues (beside above nasty one...) and it works very
> well.
> >
> > That's what I want to hear!
> >
> >> To avoid above issue that I described - you should increase
> >> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
> otherwise
> >> with more than 20 vxlan interfaces, some of them will stay in down state
> >> and have a hard traffic drop (with proper message in agent.log) with
> kernel
> >>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
> also
> >> pay attention to MTU size as well - anyway everything is in the manual
> (I
> >> updated everything I though was missing) - so please check it.
> >>
> >
> > Yes, the underlying network will all be 9000 bytes MTU.
> >
> >> Our example setup:
> >>
> >> We have i.e. bond.950 as the main VLAN which will carry all vxlan
> "tunnels"
> >> - so this is defined as KVM traffic label. In our case it didn't make
> sense
> >> to use bridge on top of this bond0.950 (as the traffic label) - you can
> >> test it on your own - since this bridge is used only to extract child
> >> bond0.950 interface name, then based on vxlan ID, ACS will provision
> >> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
> created
> >> (and then of course vNIC goes to this new bridge), so original bridge
> (to
> >> which bond0.xxx belonged) is not used for anything.
> >>
> >
> > Clear, I indeed thought something like that would happen.
> >
> >> Here is sample from above for vxlan 867 used for tenant isolation:
> >>
> >> root@hostname:~# brctl show brvx-867
> >>
> >> bridge name     bridge id               STP enabled     interfaces
> >> brvx-867                8000.2215cfce99ce       no              vnet6
> >>
> >>      vxlan867
> >>
> >> root@hostname:~# ip -d link show vxlan867
> >>
> >> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
> >> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
> >>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
> >>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
> 300
> >>
> >> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
> >>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> >>
> >> So note how the vxlan interface has by 50 bytes smaller MTU than the
> >> bond0.950 parent interface (which could affects traffic inside VM) - so
> >> jumbo frames are needed anyway on the parent interface (bond.950 in
> example
> >> above with minimum of 1550 MTU)
> >>
> >
> > Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> > networks underneath will be ~9k.
> >
> >> Ping me if more details needed, happy to help.
> >>
> >
> > Awesome! We'll be doing a PoC rather soon. I'll come back with our
> > experiences later.
> >
> > Wido
> >
> >> Cheers
> >> Andrija
> >>
> >> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I just wanted to know if there are people out there using KVM with
> >>> Advanced Networking and using VXLAN for different networks.
> >>>
> >>> Our main goal would be to spawn a VM and based on the network the NIC
> is
> >>> in attach it to a different VXLAN bridge on the KVM host.
> >>>
> >>> It seems to me that this should work, but I just wanted to check and
> see
> >>> if people have experience with it.
> >>>
> >>> Wido
> >>>
> >>
> >>
> >
>

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/23/18 2:54 PM, Ivan Kudryavtsev wrote:
> Doesn't solution like this works seamlessly for large VXLAN networks?
> 
> https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn

We are using that with CloudStack right now. We have a modified version
of 'modifyvxlan.sh':
https://github.com/PCextreme/cloudstack/blob/vxlan-bgp-evpn/scripts/vm/network/vnet/modifyvxlan.sh

Your 'tunnelip' needs to be set on 'lo', in our case this is
10.255.255.255.X

We have the script in /usr/share/modifyvxlan.sh so that it's found by
the Agent and we don't overwrite the existing script (which might break
after a upgrade).

Our frr conf on the hypervisor:

frr version 7.1
frr defaults traditional
hostname myfirsthypervisor
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config
!
interface enp81s0f0
 no ipv6 nd suppress-ra
!
interface enp81s0f1
 no ipv6 nd suppress-ra
!
interface lo
 ip address 10.255.255.9/32
 ipv6 address 2001:db8:100::9/128
!
router bgp 4200100123
 bgp router-id 10.255.255.9
 no bgp default ipv4-unicast
 neighbor uplinks peer-group
 neighbor uplinks remote-as external
 neighbor uplinks ebgp-multihop 255
 neighbor enp81s0f0 interface peer-group uplinks
 neighbor enp81s0f1 interface peer-group uplinks
 !
 address-family ipv4 unicast
  network 10.255.255.9/32
  neighbor uplinks activate
  neighbor uplinks next-hop-self
 exit-address-family
 !
 address-family ipv6 unicast
  network 2001:db8:100::9/128
  neighbor uplinks activate
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor uplinks activate
  advertise-all-vni
 exit-address-family
!
line vty
!

Both enp81s0f0 and enp81s0f1 are 100G interfaces connected to Cumulus
Linux routers/switches and they use BGP Unnumbered (IPv6 Link Local) for
their BGP sessions.

Hope this helps!

Wido

> 
> вт, 23 окт. 2018 г., 8:34 Simon Weller <sw...@ena.com.invalid>:
> 
>> Linux native VXLAN uses multicast and each host has to participate in
>> multicast in order to see the VXLAN networks. We haven't tried using PIM
>> across a L3 boundary with ACS, although it will probably work fine.
>>
>> Another option is to use a L3 VTEP, but right now there is no native
>> support for that in CloudStack's VXLAN implementation, although we've
>> thought about proposing it as feature.
>>
>>
>> ________________________________
>> From: Wido den Hollander <wi...@widodh.nl>
>> Sent: Tuesday, October 23, 2018 7:17 AM
>> To: dev@cloudstack.apache.org; Simon Weller
>> Subject: Re: VXLAN and KVm experiences
>>
>>
>>
>> On 10/23/18 1:51 PM, Simon Weller wrote:
>>> We've also been using VXLAN on KVM for all of our isolated VPC guest
>> networks for quite a long time now. As Andrija pointed out, make sure you
>> increase the max_igmp_memberships param and also put an ip address on each
>> interface host VXLAN interface in the same subnet for all hosts that will
>> share networking, or multicast won't work.
>>>
>>
>> Thanks! So you are saying that all hypervisors need to be in the same L2
>> network or are you routing the multicast?
>>
>> My idea was that each POD would be an isolated Layer 3 domain and that a
>> VNI would span over the different Layer 3 networks.
>>
>> I don't like STP and other Layer 2 loop-prevention systems.
>>
>> Wido
>>
>>>
>>> - Si
>>>
>>>
>>> ________________________________
>>> From: Wido den Hollander <wi...@widodh.nl>
>>> Sent: Tuesday, October 23, 2018 5:21 AM
>>> To: dev@cloudstack.apache.org
>>> Subject: Re: VXLAN and KVm experiences
>>>
>>>
>>>
>>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>>>> Hi Wido,
>>>>
>>>> I have "pioneered" this one in production for last 3 years (and
>> suffered a
>>>> nasty pain of silent drop of packages on kernel 3.X back in the days
>>>> because of being unaware of max_igmp_memberships kernel parameters, so I
>>>> have updated the manual long time ago).
>>>>
>>>> I never had any issues (beside above nasty one...) and it works very
>> well.
>>>
>>> That's what I want to hear!
>>>
>>>> To avoid above issue that I described - you should increase
>>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
>> otherwise
>>>> with more than 20 vxlan interfaces, some of them will stay in down state
>>>> and have a hard traffic drop (with proper message in agent.log) with
>> kernel
>>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
>> also
>>>> pay attention to MTU size as well - anyway everything is in the manual
>> (I
>>>> updated everything I though was missing) - so please check it.
>>>>
>>>
>>> Yes, the underlying network will all be 9000 bytes MTU.
>>>
>>>> Our example setup:
>>>>
>>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
>> "tunnels"
>>>> - so this is defined as KVM traffic label. In our case it didn't make
>> sense
>>>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>>>> test it on your own - since this bridge is used only to extract child
>>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
>> created
>>>> (and then of course vNIC goes to this new bridge), so original bridge
>> (to
>>>> which bond0.xxx belonged) is not used for anything.
>>>>
>>>
>>> Clear, I indeed thought something like that would happen.
>>>
>>>> Here is sample from above for vxlan 867 used for tenant isolation:
>>>>
>>>> root@hostname:~# brctl show brvx-867
>>>>
>>>> bridge name     bridge id               STP enabled     interfaces
>>>> brvx-867                8000.2215cfce99ce       no              vnet6
>>>>
>>>>      vxlan867
>>>>
>>>> root@hostname:~# ip -d link show vxlan867
>>>>
>>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
>> 300
>>>>
>>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>>>
>>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>>>> bond0.950 parent interface (which could affects traffic inside VM) - so
>>>> jumbo frames are needed anyway on the parent interface (bond.950 in
>> example
>>>> above with minimum of 1550 MTU)
>>>>
>>>
>>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>>> networks underneath will be ~9k.
>>>
>>>> Ping me if more details needed, happy to help.
>>>>
>>>
>>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>>> experiences later.
>>>
>>> Wido
>>>
>>>> Cheers
>>>> Andrija
>>>>
>>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl>
>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I just wanted to know if there are people out there using KVM with
>>>>> Advanced Networking and using VXLAN for different networks.
>>>>>
>>>>> Our main goal would be to spawn a VM and based on the network the NIC
>> is
>>>>> in attach it to a different VXLAN bridge on the KVM host.
>>>>>
>>>>> It seems to me that this should work, but I just wanted to check and
>> see
>>>>> if people have experience with it.
>>>>>
>>>>> Wido
>>>>>
>>>>
>>>>
>>>
>>
> 

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.
Hi,

On 12/28/18 5:43 PM, Ivan Kudryavtsev wrote:
> Wido, that's interesting. 
> 
> Do you think that the Cumulus-based switches with BGP inside have
> advantage over classic OSPF-based routing switches and separate multihop
> MP BGP route-servers for VNI propagation? 
> 

I don't know. We do not use OSFP anywhere in our netwerk. We are a
(i)BGP network only.

We want to use as much Open Software as possible. Buy switches we like
and then add ONIE based Software like Cumulus.

> I'm thinking about pure L3 OSPF-based backend networks for management
> and storage where cloudstack uses bridges on dummy interfaces with IP
> assigned while real NICS use utility IP-addresses in several OSPF
> networks and all those target IPs are distributed with OSPF. 
> 
> Next, VNI-s are created over bridges and their information is
> distributed over BGP. 
> 
> This approach helps to implement fault tolerance and multi-path routes
> with standard L3 stack without xSTP, VCS, etc, decrease broadcast domains.
> 
> Any thoughts?
> 

I wouldn't know for sure, we haven't looked into this yet.

Again, our plan, but not set in stone is:

- Unnumbered BGP (IPv6 Link Local) to all Hypervisors
- Link balancing using ECMP
- BGP+EVPN for VXLAN VNI distribution
- Use a static VNI for CloudStack POD IPv4
- Adapt the *modifyvxlan.sh* script to suit our needs

This way the transport of traffic will be all be done in a IPv6 only
fashion.

IPv4 to the hypervisors (POD Traffic and NFS SS) is all done by a VXLAN
device we create manually on them.

Wido

> 
> пт, 28 дек. 2018 г. в 05:34, Wido den Hollander <wido@widodh.nl
> <ma...@widodh.nl>>:
> 
> 
> 
>     On 10/23/18 2:54 PM, Ivan Kudryavtsev wrote:
>     > Doesn't solution like this works seamlessly for large VXLAN networks?
>     >
>     > https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn
>     >
> 
>     This is what we are looking into right now.
> 
>     As CloudStack executes *modifyvxlan.sh* prior to starting an Instance it
>     would be just a matter of replacing this script with a version which
>     does the EVPN for us.
> 
>     Our routers will probably be 36x100G SuperMicro Bare Matel switches
>     running Cumulus.
> 
>     Using unnumbered BGP over IPv6 we'll provide network connectivity to the
>     Hypervisors.
> 
>     Using FFR and EVPN we'll be able to enable VXLAN on the hypervisors and
>     route traffic.
> 
>     As these things seem to be very use-case specific I don't see how we can
>     integrate this into CloudStack in a generic way.
> 
>     The *modifyvxlan.sh* script gets the VNI as a argument, so anybody can
>     adapt it to their own needs for their specific environment.
> 
>     Wido
> 
>     > вт, 23 окт. 2018 г., 8:34 Simon Weller <sw...@ena.com.invalid>:
>     >
>     >> Linux native VXLAN uses multicast and each host has to participate in
>     >> multicast in order to see the VXLAN networks. We haven't tried
>     using PIM
>     >> across a L3 boundary with ACS, although it will probably work fine.
>     >>
>     >> Another option is to use a L3 VTEP, but right now there is no native
>     >> support for that in CloudStack's VXLAN implementation, although we've
>     >> thought about proposing it as feature.
>     >>
>     >>
>     >> ________________________________
>     >> From: Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>>
>     >> Sent: Tuesday, October 23, 2018 7:17 AM
>     >> To: dev@cloudstack.apache.org <ma...@cloudstack.apache.org>;
>     Simon Weller
>     >> Subject: Re: VXLAN and KVm experiences
>     >>
>     >>
>     >>
>     >> On 10/23/18 1:51 PM, Simon Weller wrote:
>     >>> We've also been using VXLAN on KVM for all of our isolated VPC guest
>     >> networks for quite a long time now. As Andrija pointed out, make
>     sure you
>     >> increase the max_igmp_memberships param and also put an ip
>     address on each
>     >> interface host VXLAN interface in the same subnet for all hosts
>     that will
>     >> share networking, or multicast won't work.
>     >>>
>     >>
>     >> Thanks! So you are saying that all hypervisors need to be in the
>     same L2
>     >> network or are you routing the multicast?
>     >>
>     >> My idea was that each POD would be an isolated Layer 3 domain and
>     that a
>     >> VNI would span over the different Layer 3 networks.
>     >>
>     >> I don't like STP and other Layer 2 loop-prevention systems.
>     >>
>     >> Wido
>     >>
>     >>>
>     >>> - Si
>     >>>
>     >>>
>     >>> ________________________________
>     >>> From: Wido den Hollander <wido@widodh.nl <ma...@widodh.nl>>
>     >>> Sent: Tuesday, October 23, 2018 5:21 AM
>     >>> To: dev@cloudstack.apache.org <ma...@cloudstack.apache.org>
>     >>> Subject: Re: VXLAN and KVm experiences
>     >>>
>     >>>
>     >>>
>     >>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>     >>>> Hi Wido,
>     >>>>
>     >>>> I have "pioneered" this one in production for last 3 years (and
>     >> suffered a
>     >>>> nasty pain of silent drop of packages on kernel 3.X back in the
>     days
>     >>>> because of being unaware of max_igmp_memberships kernel
>     parameters, so I
>     >>>> have updated the manual long time ago).
>     >>>>
>     >>>> I never had any issues (beside above nasty one...) and it works
>     very
>     >> well.
>     >>>
>     >>> That's what I want to hear!
>     >>>
>     >>>> To avoid above issue that I described - you should increase
>     >>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
>     >> otherwise
>     >>>> with more than 20 vxlan interfaces, some of them will stay in
>     down state
>     >>>> and have a hard traffic drop (with proper message in agent.log)
>     with
>     >> kernel
>     >>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...)
>     - and
>     >> also
>     >>>> pay attention to MTU size as well - anyway everything is in the
>     manual
>     >> (I
>     >>>> updated everything I though was missing) - so please check it.
>     >>>>
>     >>>
>     >>> Yes, the underlying network will all be 9000 bytes MTU.
>     >>>
>     >>>> Our example setup:
>     >>>>
>     >>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
>     >> "tunnels"
>     >>>> - so this is defined as KVM traffic label. In our case it
>     didn't make
>     >> sense
>     >>>> to use bridge on top of this bond0.950 (as the traffic label) -
>     you can
>     >>>> test it on your own - since this bridge is used only to extract
>     child
>     >>>> bond0.950 interface name, then based on vxlan ID, ACS will
>     provision
>     >>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
>     >> created
>     >>>> (and then of course vNIC goes to this new bridge), so original
>     bridge
>     >> (to
>     >>>> which bond0.xxx belonged) is not used for anything.
>     >>>>
>     >>>
>     >>> Clear, I indeed thought something like that would happen.
>     >>>
>     >>>> Here is sample from above for vxlan 867 used for tenant isolation:
>     >>>>
>     >>>> root@hostname:~# brctl show brvx-867
>     >>>>
>     >>>> bridge name     bridge id               STP enabled     interfaces
>     >>>> brvx-867                8000.2215cfce99ce       no             
>     vnet6
>     >>>>
>     >>>>      vxlan867
>     >>>>
>     >>>> root@hostname:~# ip -d link show vxlan867
>     >>>>
>     >>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc
>     noqueue
>     >>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>     >>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff
>     promiscuity 1
>     >>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10
>     ageing
>     >> 300
>     >>>>
>     >>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>     >>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>     >>>>
>     >>>> So note how the vxlan interface has by 50 bytes smaller MTU
>     than the
>     >>>> bond0.950 parent interface (which could affects traffic inside
>     VM) - so
>     >>>> jumbo frames are needed anyway on the parent interface (bond.950 in
>     >> example
>     >>>> above with minimum of 1550 MTU)
>     >>>>
>     >>>
>     >>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>     >>> networks underneath will be ~9k.
>     >>>
>     >>>> Ping me if more details needed, happy to help.
>     >>>>
>     >>>
>     >>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>     >>> experiences later.
>     >>>
>     >>> Wido
>     >>>
>     >>>> Cheers
>     >>>> Andrija
>     >>>>
>     >>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander
>     <wido@widodh.nl <ma...@widodh.nl>>
>     >> wrote:
>     >>>>
>     >>>>> Hi,
>     >>>>>
>     >>>>> I just wanted to know if there are people out there using KVM with
>     >>>>> Advanced Networking and using VXLAN for different networks.
>     >>>>>
>     >>>>> Our main goal would be to spawn a VM and based on the network
>     the NIC
>     >> is
>     >>>>> in attach it to a different VXLAN bridge on the KVM host.
>     >>>>>
>     >>>>> It seems to me that this should work, but I just wanted to
>     check and
>     >> see
>     >>>>> if people have experience with it.
>     >>>>>
>     >>>>> Wido
>     >>>>>
>     >>>>
>     >>>>
>     >>>
>     >>
>     >
> 
> 
> 
> -- 
> With best regards, Ivan Kudryavtsev
> Bitworks LLC
> Cell RU: +7-923-414-1515
> Cell USA: +1-201-257-1512
> WWW: http://bitworks.software/ <http://bw-sw.com/>
> 

Re: VXLAN and KVm experiences

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.
Wido, that's interesting.

Do you think that the Cumulus-based switches with BGP inside have advantage
over classic OSPF-based routing switches and separate multihop MP BGP
route-servers for VNI propagation?

I'm thinking about pure L3 OSPF-based backend networks for management and
storage where cloudstack uses bridges on dummy interfaces with IP assigned
while real NICS use utility IP-addresses in several OSPF networks and all
those target IPs are distributed with OSPF.

Next, VNI-s are created over bridges and their information is distributed
over BGP.

This approach helps to implement fault tolerance and multi-path routes with
standard L3 stack without xSTP, VCS, etc, decrease broadcast domains.

Any thoughts?


пт, 28 дек. 2018 г. в 05:34, Wido den Hollander <wi...@widodh.nl>:

>
>
> On 10/23/18 2:54 PM, Ivan Kudryavtsev wrote:
> > Doesn't solution like this works seamlessly for large VXLAN networks?
> >
> > https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn
> >
>
> This is what we are looking into right now.
>
> As CloudStack executes *modifyvxlan.sh* prior to starting an Instance it
> would be just a matter of replacing this script with a version which
> does the EVPN for us.
>
> Our routers will probably be 36x100G SuperMicro Bare Matel switches
> running Cumulus.
>
> Using unnumbered BGP over IPv6 we'll provide network connectivity to the
> Hypervisors.
>
> Using FFR and EVPN we'll be able to enable VXLAN on the hypervisors and
> route traffic.
>
> As these things seem to be very use-case specific I don't see how we can
> integrate this into CloudStack in a generic way.
>
> The *modifyvxlan.sh* script gets the VNI as a argument, so anybody can
> adapt it to their own needs for their specific environment.
>
> Wido
>
> > вт, 23 окт. 2018 г., 8:34 Simon Weller <sw...@ena.com.invalid>:
> >
> >> Linux native VXLAN uses multicast and each host has to participate in
> >> multicast in order to see the VXLAN networks. We haven't tried using PIM
> >> across a L3 boundary with ACS, although it will probably work fine.
> >>
> >> Another option is to use a L3 VTEP, but right now there is no native
> >> support for that in CloudStack's VXLAN implementation, although we've
> >> thought about proposing it as feature.
> >>
> >>
> >> ________________________________
> >> From: Wido den Hollander <wi...@widodh.nl>
> >> Sent: Tuesday, October 23, 2018 7:17 AM
> >> To: dev@cloudstack.apache.org; Simon Weller
> >> Subject: Re: VXLAN and KVm experiences
> >>
> >>
> >>
> >> On 10/23/18 1:51 PM, Simon Weller wrote:
> >>> We've also been using VXLAN on KVM for all of our isolated VPC guest
> >> networks for quite a long time now. As Andrija pointed out, make sure
> you
> >> increase the max_igmp_memberships param and also put an ip address on
> each
> >> interface host VXLAN interface in the same subnet for all hosts that
> will
> >> share networking, or multicast won't work.
> >>>
> >>
> >> Thanks! So you are saying that all hypervisors need to be in the same L2
> >> network or are you routing the multicast?
> >>
> >> My idea was that each POD would be an isolated Layer 3 domain and that a
> >> VNI would span over the different Layer 3 networks.
> >>
> >> I don't like STP and other Layer 2 loop-prevention systems.
> >>
> >> Wido
> >>
> >>>
> >>> - Si
> >>>
> >>>
> >>> ________________________________
> >>> From: Wido den Hollander <wi...@widodh.nl>
> >>> Sent: Tuesday, October 23, 2018 5:21 AM
> >>> To: dev@cloudstack.apache.org
> >>> Subject: Re: VXLAN and KVm experiences
> >>>
> >>>
> >>>
> >>> On 10/23/18 11:21 AM, Andrija Panic wrote:
> >>>> Hi Wido,
> >>>>
> >>>> I have "pioneered" this one in production for last 3 years (and
> >> suffered a
> >>>> nasty pain of silent drop of packages on kernel 3.X back in the days
> >>>> because of being unaware of max_igmp_memberships kernel parameters,
> so I
> >>>> have updated the manual long time ago).
> >>>>
> >>>> I never had any issues (beside above nasty one...) and it works very
> >> well.
> >>>
> >>> That's what I want to hear!
> >>>
> >>>> To avoid above issue that I described - you should increase
> >>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
> >> otherwise
> >>>> with more than 20 vxlan interfaces, some of them will stay in down
> state
> >>>> and have a hard traffic drop (with proper message in agent.log) with
> >> kernel
> >>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
> >> also
> >>>> pay attention to MTU size as well - anyway everything is in the manual
> >> (I
> >>>> updated everything I though was missing) - so please check it.
> >>>>
> >>>
> >>> Yes, the underlying network will all be 9000 bytes MTU.
> >>>
> >>>> Our example setup:
> >>>>
> >>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
> >> "tunnels"
> >>>> - so this is defined as KVM traffic label. In our case it didn't make
> >> sense
> >>>> to use bridge on top of this bond0.950 (as the traffic label) - you
> can
> >>>> test it on your own - since this bridge is used only to extract child
> >>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
> >>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
> >> created
> >>>> (and then of course vNIC goes to this new bridge), so original bridge
> >> (to
> >>>> which bond0.xxx belonged) is not used for anything.
> >>>>
> >>>
> >>> Clear, I indeed thought something like that would happen.
> >>>
> >>>> Here is sample from above for vxlan 867 used for tenant isolation:
> >>>>
> >>>> root@hostname:~# brctl show brvx-867
> >>>>
> >>>> bridge name     bridge id               STP enabled     interfaces
> >>>> brvx-867                8000.2215cfce99ce       no              vnet6
> >>>>
> >>>>      vxlan867
> >>>>
> >>>> root@hostname:~# ip -d link show vxlan867
> >>>>
> >>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc
> noqueue
> >>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
> >>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
> >>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
> >> 300
> >>>>
> >>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
> >>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> >>>>
> >>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
> >>>> bond0.950 parent interface (which could affects traffic inside VM) -
> so
> >>>> jumbo frames are needed anyway on the parent interface (bond.950 in
> >> example
> >>>> above with minimum of 1550 MTU)
> >>>>
> >>>
> >>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> >>> networks underneath will be ~9k.
> >>>
> >>>> Ping me if more details needed, happy to help.
> >>>>
> >>>
> >>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> >>> experiences later.
> >>>
> >>> Wido
> >>>
> >>>> Cheers
> >>>> Andrija
> >>>>
> >>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl>
> >> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I just wanted to know if there are people out there using KVM with
> >>>>> Advanced Networking and using VXLAN for different networks.
> >>>>>
> >>>>> Our main goal would be to spawn a VM and based on the network the NIC
> >> is
> >>>>> in attach it to a different VXLAN bridge on the KVM host.
> >>>>>
> >>>>> It seems to me that this should work, but I just wanted to check and
> >> see
> >>>>> if people have experience with it.
> >>>>>
> >>>>> Wido
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>


-- 
With best regards, Ivan Kudryavtsev
Bitworks LLC
Cell RU: +7-923-414-1515
Cell USA: +1-201-257-1512
WWW: http://bitworks.software/ <http://bw-sw.com/>

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/23/18 2:54 PM, Ivan Kudryavtsev wrote:
> Doesn't solution like this works seamlessly for large VXLAN networks?
> 
> https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn
> 

This is what we are looking into right now.

As CloudStack executes *modifyvxlan.sh* prior to starting an Instance it
would be just a matter of replacing this script with a version which
does the EVPN for us.

Our routers will probably be 36x100G SuperMicro Bare Matel switches
running Cumulus.

Using unnumbered BGP over IPv6 we'll provide network connectivity to the
Hypervisors.

Using FFR and EVPN we'll be able to enable VXLAN on the hypervisors and
route traffic.

As these things seem to be very use-case specific I don't see how we can
integrate this into CloudStack in a generic way.

The *modifyvxlan.sh* script gets the VNI as a argument, so anybody can
adapt it to their own needs for their specific environment.

Wido

> вт, 23 окт. 2018 г., 8:34 Simon Weller <sw...@ena.com.invalid>:
> 
>> Linux native VXLAN uses multicast and each host has to participate in
>> multicast in order to see the VXLAN networks. We haven't tried using PIM
>> across a L3 boundary with ACS, although it will probably work fine.
>>
>> Another option is to use a L3 VTEP, but right now there is no native
>> support for that in CloudStack's VXLAN implementation, although we've
>> thought about proposing it as feature.
>>
>>
>> ________________________________
>> From: Wido den Hollander <wi...@widodh.nl>
>> Sent: Tuesday, October 23, 2018 7:17 AM
>> To: dev@cloudstack.apache.org; Simon Weller
>> Subject: Re: VXLAN and KVm experiences
>>
>>
>>
>> On 10/23/18 1:51 PM, Simon Weller wrote:
>>> We've also been using VXLAN on KVM for all of our isolated VPC guest
>> networks for quite a long time now. As Andrija pointed out, make sure you
>> increase the max_igmp_memberships param and also put an ip address on each
>> interface host VXLAN interface in the same subnet for all hosts that will
>> share networking, or multicast won't work.
>>>
>>
>> Thanks! So you are saying that all hypervisors need to be in the same L2
>> network or are you routing the multicast?
>>
>> My idea was that each POD would be an isolated Layer 3 domain and that a
>> VNI would span over the different Layer 3 networks.
>>
>> I don't like STP and other Layer 2 loop-prevention systems.
>>
>> Wido
>>
>>>
>>> - Si
>>>
>>>
>>> ________________________________
>>> From: Wido den Hollander <wi...@widodh.nl>
>>> Sent: Tuesday, October 23, 2018 5:21 AM
>>> To: dev@cloudstack.apache.org
>>> Subject: Re: VXLAN and KVm experiences
>>>
>>>
>>>
>>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>>>> Hi Wido,
>>>>
>>>> I have "pioneered" this one in production for last 3 years (and
>> suffered a
>>>> nasty pain of silent drop of packages on kernel 3.X back in the days
>>>> because of being unaware of max_igmp_memberships kernel parameters, so I
>>>> have updated the manual long time ago).
>>>>
>>>> I never had any issues (beside above nasty one...) and it works very
>> well.
>>>
>>> That's what I want to hear!
>>>
>>>> To avoid above issue that I described - you should increase
>>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
>> otherwise
>>>> with more than 20 vxlan interfaces, some of them will stay in down state
>>>> and have a hard traffic drop (with proper message in agent.log) with
>> kernel
>>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
>> also
>>>> pay attention to MTU size as well - anyway everything is in the manual
>> (I
>>>> updated everything I though was missing) - so please check it.
>>>>
>>>
>>> Yes, the underlying network will all be 9000 bytes MTU.
>>>
>>>> Our example setup:
>>>>
>>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
>> "tunnels"
>>>> - so this is defined as KVM traffic label. In our case it didn't make
>> sense
>>>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>>>> test it on your own - since this bridge is used only to extract child
>>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
>> created
>>>> (and then of course vNIC goes to this new bridge), so original bridge
>> (to
>>>> which bond0.xxx belonged) is not used for anything.
>>>>
>>>
>>> Clear, I indeed thought something like that would happen.
>>>
>>>> Here is sample from above for vxlan 867 used for tenant isolation:
>>>>
>>>> root@hostname:~# brctl show brvx-867
>>>>
>>>> bridge name     bridge id               STP enabled     interfaces
>>>> brvx-867                8000.2215cfce99ce       no              vnet6
>>>>
>>>>      vxlan867
>>>>
>>>> root@hostname:~# ip -d link show vxlan867
>>>>
>>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
>> 300
>>>>
>>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>>>
>>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>>>> bond0.950 parent interface (which could affects traffic inside VM) - so
>>>> jumbo frames are needed anyway on the parent interface (bond.950 in
>> example
>>>> above with minimum of 1550 MTU)
>>>>
>>>
>>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>>> networks underneath will be ~9k.
>>>
>>>> Ping me if more details needed, happy to help.
>>>>
>>>
>>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>>> experiences later.
>>>
>>> Wido
>>>
>>>> Cheers
>>>> Andrija
>>>>
>>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl>
>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I just wanted to know if there are people out there using KVM with
>>>>> Advanced Networking and using VXLAN for different networks.
>>>>>
>>>>> Our main goal would be to spawn a VM and based on the network the NIC
>> is
>>>>> in attach it to a different VXLAN bridge on the KVM host.
>>>>>
>>>>> It seems to me that this should work, but I just wanted to check and
>> see
>>>>> if people have experience with it.
>>>>>
>>>>> Wido
>>>>>
>>>>
>>>>
>>>
>>
> 

Re: VXLAN and KVm experiences

Posted by Ivan Kudryavtsev <ku...@bw-sw.com>.
Doesn't solution like this works seamlessly for large VXLAN networks?

https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn

вт, 23 окт. 2018 г., 8:34 Simon Weller <sw...@ena.com.invalid>:

> Linux native VXLAN uses multicast and each host has to participate in
> multicast in order to see the VXLAN networks. We haven't tried using PIM
> across a L3 boundary with ACS, although it will probably work fine.
>
> Another option is to use a L3 VTEP, but right now there is no native
> support for that in CloudStack's VXLAN implementation, although we've
> thought about proposing it as feature.
>
>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 7:17 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 1:51 PM, Simon Weller wrote:
> > We've also been using VXLAN on KVM for all of our isolated VPC guest
> networks for quite a long time now. As Andrija pointed out, make sure you
> increase the max_igmp_memberships param and also put an ip address on each
> interface host VXLAN interface in the same subnet for all hosts that will
> share networking, or multicast won't work.
> >
>
> Thanks! So you are saying that all hypervisors need to be in the same L2
> network or are you routing the multicast?
>
> My idea was that each POD would be an isolated Layer 3 domain and that a
> VNI would span over the different Layer 3 networks.
>
> I don't like STP and other Layer 2 loop-prevention systems.
>
> Wido
>
> >
> > - Si
> >
> >
> > ________________________________
> > From: Wido den Hollander <wi...@widodh.nl>
> > Sent: Tuesday, October 23, 2018 5:21 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: VXLAN and KVm experiences
> >
> >
> >
> > On 10/23/18 11:21 AM, Andrija Panic wrote:
> >> Hi Wido,
> >>
> >> I have "pioneered" this one in production for last 3 years (and
> suffered a
> >> nasty pain of silent drop of packages on kernel 3.X back in the days
> >> because of being unaware of max_igmp_memberships kernel parameters, so I
> >> have updated the manual long time ago).
> >>
> >> I never had any issues (beside above nasty one...) and it works very
> well.
> >
> > That's what I want to hear!
> >
> >> To avoid above issue that I described - you should increase
> >> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
> otherwise
> >> with more than 20 vxlan interfaces, some of them will stay in down state
> >> and have a hard traffic drop (with proper message in agent.log) with
> kernel
> >>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
> also
> >> pay attention to MTU size as well - anyway everything is in the manual
> (I
> >> updated everything I though was missing) - so please check it.
> >>
> >
> > Yes, the underlying network will all be 9000 bytes MTU.
> >
> >> Our example setup:
> >>
> >> We have i.e. bond.950 as the main VLAN which will carry all vxlan
> "tunnels"
> >> - so this is defined as KVM traffic label. In our case it didn't make
> sense
> >> to use bridge on top of this bond0.950 (as the traffic label) - you can
> >> test it on your own - since this bridge is used only to extract child
> >> bond0.950 interface name, then based on vxlan ID, ACS will provision
> >> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
> created
> >> (and then of course vNIC goes to this new bridge), so original bridge
> (to
> >> which bond0.xxx belonged) is not used for anything.
> >>
> >
> > Clear, I indeed thought something like that would happen.
> >
> >> Here is sample from above for vxlan 867 used for tenant isolation:
> >>
> >> root@hostname:~# brctl show brvx-867
> >>
> >> bridge name     bridge id               STP enabled     interfaces
> >> brvx-867                8000.2215cfce99ce       no              vnet6
> >>
> >>      vxlan867
> >>
> >> root@hostname:~# ip -d link show vxlan867
> >>
> >> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
> >> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
> >>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
> >>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
> 300
> >>
> >> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
> >>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> >>
> >> So note how the vxlan interface has by 50 bytes smaller MTU than the
> >> bond0.950 parent interface (which could affects traffic inside VM) - so
> >> jumbo frames are needed anyway on the parent interface (bond.950 in
> example
> >> above with minimum of 1550 MTU)
> >>
> >
> > Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> > networks underneath will be ~9k.
> >
> >> Ping me if more details needed, happy to help.
> >>
> >
> > Awesome! We'll be doing a PoC rather soon. I'll come back with our
> > experiences later.
> >
> > Wido
> >
> >> Cheers
> >> Andrija
> >>
> >> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I just wanted to know if there are people out there using KVM with
> >>> Advanced Networking and using VXLAN for different networks.
> >>>
> >>> Our main goal would be to spawn a VM and based on the network the NIC
> is
> >>> in attach it to a different VXLAN bridge on the KVM host.
> >>>
> >>> It seems to me that this should work, but I just wanted to check and
> see
> >>> if people have experience with it.
> >>>
> >>> Wido
> >>>
> >>
> >>
> >
>

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.

On 11/14/18 6:25 PM, Simon Weller wrote:
> Wido,
> 
> 
> Here is the original document on the implemention for VXLAN in ACS
> - https://cwiki.apache.org/confluence/display/CLOUDSTACK/Linux+native+VXLAN+support+on+KVM+hypervisor
> 
> It may shed some light on the reasons for the different multicast groups.
> 

Yes, I see now. It is to prevent a single multicast group being flooded
with traffic for VNIs.

Thanks!

Wido

>  
> - Si
> 
> ------------------------------------------------------------------------
> *From:* Wido den Hollander <wi...@widodh.nl>
> *Sent:* Tuesday, November 13, 2018 4:40 AM
> *To:* dev@cloudstack.apache.org; Simon Weller
> *Subject:* Re: VXLAN and KVm experiences
>  
> 
> 
> On 10/23/18 2:34 PM, Simon Weller wrote:
>> Linux native VXLAN uses multicast and each host has to participate in multicast in order to see the VXLAN networks. We haven't tried using PIM across a L3 boundary with ACS, although it will probably work fine.
>> 
>> Another option is to use a L3 VTEP, but right now there is no native support for that in CloudStack's VXLAN implementation, although we've thought about proposing it as feature.
>> 
> 
> Getting back to this I see CloudStack does this:
> 
> local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
> 256 )).$(( $vxlanId % 256 ))"
> 
> VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.
> 
> Why are we using a different mcast group for every VNI? As the VNI is
> encoded in the packet this should just work in one group, right?
> 
> Because this way you need to configure all those groups on your
> Router(s) as each VNI will use a different Multicast Group.
> 
> I'm just looking for the reason why we have this different multicast groups.
> 
> I was thinking that we might want to add a option to agent.properties
> where we allow users to set a fixed Multicast group for all traffic.
> 
> Wido
> 
> [0]:
> https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33
> 
> 
> 
>> 
>> ________________________________
>> From: Wido den Hollander <wi...@widodh.nl>
>> Sent: Tuesday, October 23, 2018 7:17 AM
>> To: dev@cloudstack.apache.org; Simon Weller
>> Subject: Re: VXLAN and KVm experiences
>> 
>> 
>> 
>> On 10/23/18 1:51 PM, Simon Weller wrote:
>>> We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast
> won't work.
>>>
>> 
>> Thanks! So you are saying that all hypervisors need to be in the same L2
>> network or are you routing the multicast?
>> 
>> My idea was that each POD would be an isolated Layer 3 domain and that a
>> VNI would span over the different Layer 3 networks.
>> 
>> I don't like STP and other Layer 2 loop-prevention systems.
>> 
>> Wido
>> 
>>>
>>> - Si
>>>
>>>
>>> ________________________________
>>> From: Wido den Hollander <wi...@widodh.nl>
>>> Sent: Tuesday, October 23, 2018 5:21 AM
>>> To: dev@cloudstack.apache.org
>>> Subject: Re: VXLAN and KVm experiences
>>>
>>>
>>>
>>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>>>> Hi Wido,
>>>>
>>>> I have "pioneered" this one in production for last 3 years (and suffered a
>>>> nasty pain of silent drop of packages on kernel 3.X back in the days
>>>> because of being unaware of max_igmp_memberships kernel parameters, so I
>>>> have updated the manual long time ago).
>>>>
>>>> I never had any issues (beside above nasty one...) and it works very well.
>>>
>>> That's what I want to hear!
>>>
>>>> To avoid above issue that I described - you should increase
>>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>>>> with more than 20 vxlan interfaces, some of them will stay in down state
>>>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>>>> pay attention to MTU size as well - anyway everything is in the manual (I
>>>> updated everything I though was missing) - so please check it.
>>>>
>>>
>>> Yes, the underlying network will all be 9000 bytes MTU.
>>>
>>>> Our example setup:
>>>>
>>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>>>> - so this is defined as KVM traffic label. In our case it didn't make sense
>>>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>>>> test it on your own - since this bridge is used only to extract child
>>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>>>> (and then of course vNIC goes to this new bridge), so original bridge (to
>>>> which bond0.xxx belonged) is not used for anything.
>>>>
>>>
>>> Clear, I indeed thought something like that would happen.
>>>
>>>> Here is sample from above for vxlan 867 used for tenant isolation:
>>>>
>>>> root@hostname:~# brctl show brvx-867
>>>>
>>>> bridge name     bridge id               STP enabled     interfaces
>>>> brvx-867                8000.2215cfce99ce       no              vnet6
>>>>
>>>>      vxlan867
>>>>
>>>> root@hostname:~# ip -d link show vxlan867
>>>>
>>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>>>
>>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>>>
>>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>>>> bond0.950 parent interface (which could affects traffic inside VM) - so
>>>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>>>> above with minimum of 1550 MTU)
>>>>
>>>
>>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>>> networks underneath will be ~9k.
>>>
>>>> Ping me if more details needed, happy to help.
>>>>
>>>
>>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>>> experiences later.
>>>
>>> Wido
>>>
>>>> Cheers
>>>> Andrija
>>>>
>>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I just wanted to know if there are people out there using KVM with
>>>>> Advanced Networking and using VXLAN for different networks.
>>>>>
>>>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>>>> in attach it to a different VXLAN bridge on the KVM host.
>>>>>
>>>>> It seems to me that this should work, but I just wanted to check and see
>>>>> if people have experience with it.
>>>>>
>>>>> Wido
>>>>>
>>>>
>>>>
>>>
>> 

Re: VXLAN and KVm experiences

Posted by Andrija Panic <an...@gmail.com>.
http://docs.cloudstack.apache.org/en/4.11.1.0/plugins/vxlan.html?highlight=vxlan#important-note-on-mtu-size

 :) It's there from my early days of experimenting with vxlan.

As for LRO, this was pain in Ubuntu 14.04, I remember compiling the IXGBE
driver (because I was pissed off with stock one), with explicitly disabling
LRO during compile time  but the crappy kernel just keeps enabling it on
the NIC, even though it should automatically disabled LRO as soon as NIC
becomes part of the bridge (FYI it worked just fine on CentOS 6 with same
stock IXGBE driver version atm...)
Wonders of Ubuntu...
I used pre-up line in interfaces file to make sure it's disabled, for each
and every NIC...
They fixed this in kernel 4.x, so Ubuntu 16.04 basically...and onwards.,
which is good.

As for MTU - same story, in 14.04 there was no way to make it work with
normal default ways inside interfaces file.
So we relied heavily on rc.local, to do all actions (MTU, and some routes
etc), after the OS has already booted (all nics started with default
mtu,etc)

 I'm yet to try to install 18.04 in software raid1 setup (2xSSDs, 12xHDDs),
which failed miserable for 14.04 and 16.04 :) to see if they improved this
one (yes, same server with CEntos 6x would install just fine) hehe...

Sorry, ranting...

Cheers



On Mon, 19 Nov 2018 at 16:22, Rohit Yadav <ro...@shapeblue.com> wrote:

> Hi Andrija,
>
>
> Thanks for the pointers!!! I managed to fix my issues. It would be great
> to document your experiences on the vxlan/cloudstack docs page.
>
>
> My slow I meant the same behaviour you've described, pings and small file
> downloads are fast (few MB/s) but large file downloads/transfers speeds
> fall to few kb/bytes per second (sometimes stalls).
>
>
> I checked and found that lro was off by default:
>
> # ethtool -k enp2s0 | grep large
> large-receive-offload: off [fixed]
>
>
> Another issue I found was that with Ubuntu 18.04 netplan does not apply
> mtu settings as provided in the yaml file:
> https://bugs.launchpad.net/ubuntu/+source/nplan/+bug/1724895 (facepalm 😞)
>
> For the above issue, I used a workaround described in
> https://djanotes.blogspot.com/2018/01/netplan-setting-mtu-for-bridge-devices.html
> and after rebooting my hosts, the nested VM's were able to download files
> from public Internet per the ISP/network provided speeds.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Andrija Panic <an...@gmail.com>
> Sent: Monday, November 19, 2018 7:46:43 PM
> To: dev
> Cc: Wido den Hollander
> Subject: Re: VXLAN and KVm experiences
>
> Define slow please ? - MTU for parent interface of all vxlan interfaces is
> set to 1550 or more (vxlan interface MTU == 50 bytes less than parent
> interface) ?
>  Can you check if LRO is disabled on physical nics - with LRO issues (spent
> 2 days of my life for this back in the days...) ping is working fine, but
> any larger packer goes to almost zero KB/s... (Ubuntu thing btw...)
>
> Cheers
>
>
>
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> Amadeus House, Floral Street, London  WC2E 9DPUK
> @shapeblue
>
>
>
> On Mon, 19 Nov 2018 at 14:36, Rohit Yadav <ro...@shapeblue.com>
> wrote:
>
> > All,
> >
> > I need some pointers around vxlan debugging and configuration: (sorry for
> > the long email)
> >
> > I'm working on a concept CI system where the idea is to setup CloudStack
> > with kvm hosts and use vxlan isolation for guest, mgmt  and public
> > networks, and then run CI jobs as CloudStack projects where monkeybox VMs
> > (nested kvm VMs) run in isolated networks and are used to test a
> CloudStack
> > build/branch/PR.
> >
> > I've two Ubuntu 18.04.1 based i7 mini pcs running KVM, where there is a
> > single bridge/nic cloudbr0 to carry public, guest and mgmt network that
> is
> > vxlan based. I've set max_igmp_memberships to 200 and to see console
> proxy
> > etc I used vxlan://untagged for the public IP address range. The gigabit
> > switch between them does not support igmp snooping. Now the problem is
> that
> > in the nested VMs in an isolated network (VRs public nic plugs into
> > cloudbr0, and guest nic plugs into a bridge that has vxlan end point for
> > some VNI) , the download speed from public network is very slow. I've
> > enabled the default udp port for vxlan on both hosts. How do I debug
> > vxlans, what's going wrong? (do note that I've a single bridge for all
> > those networks, with no vlans)
> >
> >
> > Regards,
> > Rohit Yadav
> >
> > ________________________________
> > From: Simon Weller <sw...@ena.com.INVALID>
> > Sent: Wednesday, November 14, 2018 10:55:18 PM
> > To: Wido den Hollander; dev@cloudstack.apache.org
> > Subject: Re: VXLAN and KVm experiences
> >
> > Wido,
> >
> >
> > Here is the original document on the implemention for VXLAN in ACS -
> >
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Linux+native+VXLAN+support+on+KVM+hypervisor
> >
> > It may shed some light on the reasons for the different multicast groups.
> >
> >
> > - Si
> >
> > ________________________________
> > From: Wido den Hollander <wi...@widodh.nl>
> > Sent: Tuesday, November 13, 2018 4:40 AM
> > To: dev@cloudstack.apache.org; Simon Weller
> > Subject: Re: VXLAN and KVm experiences
> >
> >
> >
> > On 10/23/18 2:34 PM, Simon Weller wrote:
> > > Linux native VXLAN uses multicast and each host has to participate in
> > multicast in order to see the VXLAN networks. We haven't tried using PIM
> > across a L3 boundary with ACS, although it will probably work fine.
> > >
> > > Another option is to use a L3 VTEP, but right now there is no native
> > support for that in CloudStack's VXLAN implementation, although we've
> > thought about proposing it as feature.
> > >
> >
> > Getting back to this I see CloudStack does this:
> >
> > local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
> > 256 )).$(( $vxlanId % 256 ))"
> >
> > VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.
> >
> > Why are we using a different mcast group for every VNI? As the VNI is
> > encoded in the packet this should just work in one group, right?
> >
> > Because this way you need to configure all those groups on your
> > Router(s) as each VNI will use a different Multicast Group.
> >
> > I'm just looking for the reason why we have this different multicast
> > groups.
> >
> > I was thinking that we might want to add a option to agent.properties
> > where we allow users to set a fixed Multicast group for all traffic.
> >
> > Wido
> >
> > [0]:
> >
> >
> https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33
> >
> >
> >
> > >
> > > ________________________________
> > > From: Wido den Hollander <wi...@widodh.nl>
> > > Sent: Tuesday, October 23, 2018 7:17 AM
> > > To: dev@cloudstack.apache.org; Simon Weller
> > > Subject: Re: VXLAN and KVm experiences
> > >
> > >
> > >
> > > On 10/23/18 1:51 PM, Simon Weller wrote:
> > >> We've also been using VXLAN on KVM for all of our isolated VPC guest
> > networks for quite a long time now. As Andrija pointed out, make sure you
> > increase the max_igmp_memberships param and also put an ip address on
> each
> > interface host VXLAN interface in the same subnet for all hosts that will
> > share networking, or multicast won't work.
> > >>
> > >
> > > Thanks! So you are saying that all hypervisors need to be in the same
> L2
> > > network or are you routing the multicast?
> > >
> > > My idea was that each POD would be an isolated Layer 3 domain and that
> a
> > > VNI would span over the different Layer 3 networks.
> > >
> > > I don't like STP and other Layer 2 loop-prevention systems.
> > >
> > > Wido
> > >
> > >>
> > >> - Si
> > >>
> > >>
> > >> ________________________________
> > >> From: Wido den Hollander <wi...@widodh.nl>
> > >> Sent: Tuesday, October 23, 2018 5:21 AM
> > >> To: dev@cloudstack.apache.org
> > >> Subject: Re: VXLAN and KVm experiences
> > >>
> > >>
> > >>
> > >> On 10/23/18 11:21 AM, Andrija Panic wrote:
> > >>> Hi Wido,
> > >>>
> > >>> I have "pioneered" this one in production for last 3 years (and
> > suffered a
> > >>> nasty pain of silent drop of packages on kernel 3.X back in the days
> > >>> because of being unaware of max_igmp_memberships kernel parameters,
> so
> > I
> > >>> have updated the manual long time ago).
> > >>>
> > >>> I never had any issues (beside above nasty one...) and it works very
> > well.
> > >>
> > >> That's what I want to hear!
> > >>
> > >>> To avoid above issue that I described - you should increase
> > >>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
> > otherwise
> > >>> with more than 20 vxlan interfaces, some of them will stay in down
> > state
> > >>> and have a hard traffic drop (with proper message in agent.log) with
> > kernel
> > >>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
> > also
> > >>> pay attention to MTU size as well - anyway everything is in the
> manual
> > (I
> > >>> updated everything I though was missing) - so please check it.
> > >>>
> > >>
> > >> Yes, the underlying network will all be 9000 bytes MTU.
> > >>
> > >>> Our example setup:
> > >>>
> > >>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
> > "tunnels"
> > >>> - so this is defined as KVM traffic label. In our case it didn't make
> > sense
> > >>> to use bridge on top of this bond0.950 (as the traffic label) - you
> can
> > >>> test it on your own - since this bridge is used only to extract child
> > >>> bond0.950 interface name, then based on vxlan ID, ACS will provision
> > >>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
> > created
> > >>> (and then of course vNIC goes to this new bridge), so original bridge
> > (to
> > >>> which bond0.xxx belonged) is not used for anything.
> > >>>
> > >>
> > >> Clear, I indeed thought something like that would happen.
> > >>
> > >>> Here is sample from above for vxlan 867 used for tenant isolation:
> > >>>
> > >>> root@hostname:~# brctl show brvx-867
> > >>>
> > >>> bridge name     bridge id               STP enabled     interfaces
> > >>> brvx-867                8000.2215cfce99ce       no              vnet6
> > >>>
> > >>>      vxlan867
> > >>>
> > >>> root@hostname:~# ip -d link show vxlan867
> > >>>
> > >>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc
> noqueue
> > >>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
> > >>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
> > >>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10
> ageing
> > 300
> > >>>
> > >>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
> > >>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> > >>>
> > >>> So note how the vxlan interface has by 50 bytes smaller MTU than the
> > >>> bond0.950 parent interface (which could affects traffic inside VM) -
> so
> > >>> jumbo frames are needed anyway on the parent interface (bond.950 in
> > example
> > >>> above with minimum of 1550 MTU)
> > >>>
> > >>
> > >> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> > >> networks underneath will be ~9k.
> > >>
> > >>> Ping me if more details needed, happy to help.
> > >>>
> > >>
> > >> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> > >> experiences later.
> > >>
> > >> Wido
> > >>
> > >>> Cheers
> > >>> Andrija
> > >>>
> > >>
> > rohit.yadav@shapeblue.com
> > www.shapeblue.com<http://www.shapeblue.com>
> > Amadeus House, Floral Street, London  WC2E 9DPUK
> > @shapeblue
> >
> >
> >
> > > On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl>
> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> I just wanted to know if there are people out there using KVM with
> > >>>> Advanced Networking and using VXLAN for different networks.
> > >>>>
> > >>>> Our main goal would be to spawn a VM and based on the network the
> NIC
> > is
> > >>>> in attach it to a different VXLAN bridge on the KVM host.
> > >>>>
> > >>>> It seems to me that this should work, but I just wanted to check and
> > see
> > >>>> if people have experience with it.
> > >>>>
> > >>>> Wido
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>
>
> --
>
> Andrija Panić
>


-- 

Andrija Panić

Re: VXLAN and KVm experiences

Posted by Rohit Yadav <ro...@shapeblue.com>.
Hi Andrija,


Thanks for the pointers!!! I managed to fix my issues. It would be great to document your experiences on the vxlan/cloudstack docs page.


My slow I meant the same behaviour you've described, pings and small file downloads are fast (few MB/s) but large file downloads/transfers speeds fall to few kb/bytes per second (sometimes stalls).


I checked and found that lro was off by default:

# ethtool -k enp2s0 | grep large
large-receive-offload: off [fixed]


Another issue I found was that with Ubuntu 18.04 netplan does not apply mtu settings as provided in the yaml file: https://bugs.launchpad.net/ubuntu/+source/nplan/+bug/1724895 (facepalm 😞)

For the above issue, I used a workaround described in https://djanotes.blogspot.com/2018/01/netplan-setting-mtu-for-bridge-devices.html and after rebooting my hosts, the nested VM's were able to download files from public Internet per the ISP/network provided speeds.


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Andrija Panic <an...@gmail.com>
Sent: Monday, November 19, 2018 7:46:43 PM
To: dev
Cc: Wido den Hollander
Subject: Re: VXLAN and KVm experiences

Define slow please ? - MTU for parent interface of all vxlan interfaces is
set to 1550 or more (vxlan interface MTU == 50 bytes less than parent
interface) ?
 Can you check if LRO is disabled on physical nics - with LRO issues (spent
2 days of my life for this back in the days...) ping is working fine, but
any larger packer goes to almost zero KB/s... (Ubuntu thing btw...)

Cheers



rohit.yadav@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue
  
 

On Mon, 19 Nov 2018 at 14:36, Rohit Yadav <ro...@shapeblue.com> wrote:

> All,
>
> I need some pointers around vxlan debugging and configuration: (sorry for
> the long email)
>
> I'm working on a concept CI system where the idea is to setup CloudStack
> with kvm hosts and use vxlan isolation for guest, mgmt  and public
> networks, and then run CI jobs as CloudStack projects where monkeybox VMs
> (nested kvm VMs) run in isolated networks and are used to test a CloudStack
> build/branch/PR.
>
> I've two Ubuntu 18.04.1 based i7 mini pcs running KVM, where there is a
> single bridge/nic cloudbr0 to carry public, guest and mgmt network that is
> vxlan based. I've set max_igmp_memberships to 200 and to see console proxy
> etc I used vxlan://untagged for the public IP address range. The gigabit
> switch between them does not support igmp snooping. Now the problem is that
> in the nested VMs in an isolated network (VRs public nic plugs into
> cloudbr0, and guest nic plugs into a bridge that has vxlan end point for
> some VNI) , the download speed from public network is very slow. I've
> enabled the default udp port for vxlan on both hosts. How do I debug
> vxlans, what's going wrong? (do note that I've a single bridge for all
> those networks, with no vlans)
>
>
> Regards,
> Rohit Yadav
>
> ________________________________
> From: Simon Weller <sw...@ena.com.INVALID>
> Sent: Wednesday, November 14, 2018 10:55:18 PM
> To: Wido den Hollander; dev@cloudstack.apache.org
> Subject: Re: VXLAN and KVm experiences
>
> Wido,
>
>
> Here is the original document on the implemention for VXLAN in ACS -
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Linux+native+VXLAN+support+on+KVM+hypervisor
>
> It may shed some light on the reasons for the different multicast groups.
>
>
> - Si
>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, November 13, 2018 4:40 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 2:34 PM, Simon Weller wrote:
> > Linux native VXLAN uses multicast and each host has to participate in
> multicast in order to see the VXLAN networks. We haven't tried using PIM
> across a L3 boundary with ACS, although it will probably work fine.
> >
> > Another option is to use a L3 VTEP, but right now there is no native
> support for that in CloudStack's VXLAN implementation, although we've
> thought about proposing it as feature.
> >
>
> Getting back to this I see CloudStack does this:
>
> local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
> 256 )).$(( $vxlanId % 256 ))"
>
> VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.
>
> Why are we using a different mcast group for every VNI? As the VNI is
> encoded in the packet this should just work in one group, right?
>
> Because this way you need to configure all those groups on your
> Router(s) as each VNI will use a different Multicast Group.
>
> I'm just looking for the reason why we have this different multicast
> groups.
>
> I was thinking that we might want to add a option to agent.properties
> where we allow users to set a fixed Multicast group for all traffic.
>
> Wido
>
> [0]:
>
> https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33
>
>
>
> >
> > ________________________________
> > From: Wido den Hollander <wi...@widodh.nl>
> > Sent: Tuesday, October 23, 2018 7:17 AM
> > To: dev@cloudstack.apache.org; Simon Weller
> > Subject: Re: VXLAN and KVm experiences
> >
> >
> >
> > On 10/23/18 1:51 PM, Simon Weller wrote:
> >> We've also been using VXLAN on KVM for all of our isolated VPC guest
> networks for quite a long time now. As Andrija pointed out, make sure you
> increase the max_igmp_memberships param and also put an ip address on each
> interface host VXLAN interface in the same subnet for all hosts that will
> share networking, or multicast won't work.
> >>
> >
> > Thanks! So you are saying that all hypervisors need to be in the same L2
> > network or are you routing the multicast?
> >
> > My idea was that each POD would be an isolated Layer 3 domain and that a
> > VNI would span over the different Layer 3 networks.
> >
> > I don't like STP and other Layer 2 loop-prevention systems.
> >
> > Wido
> >
> >>
> >> - Si
> >>
> >>
> >> ________________________________
> >> From: Wido den Hollander <wi...@widodh.nl>
> >> Sent: Tuesday, October 23, 2018 5:21 AM
> >> To: dev@cloudstack.apache.org
> >> Subject: Re: VXLAN and KVm experiences
> >>
> >>
> >>
> >> On 10/23/18 11:21 AM, Andrija Panic wrote:
> >>> Hi Wido,
> >>>
> >>> I have "pioneered" this one in production for last 3 years (and
> suffered a
> >>> nasty pain of silent drop of packages on kernel 3.X back in the days
> >>> because of being unaware of max_igmp_memberships kernel parameters, so
> I
> >>> have updated the manual long time ago).
> >>>
> >>> I never had any issues (beside above nasty one...) and it works very
> well.
> >>
> >> That's what I want to hear!
> >>
> >>> To avoid above issue that I described - you should increase
> >>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
> otherwise
> >>> with more than 20 vxlan interfaces, some of them will stay in down
> state
> >>> and have a hard traffic drop (with proper message in agent.log) with
> kernel
> >>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
> also
> >>> pay attention to MTU size as well - anyway everything is in the manual
> (I
> >>> updated everything I though was missing) - so please check it.
> >>>
> >>
> >> Yes, the underlying network will all be 9000 bytes MTU.
> >>
> >>> Our example setup:
> >>>
> >>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
> "tunnels"
> >>> - so this is defined as KVM traffic label. In our case it didn't make
> sense
> >>> to use bridge on top of this bond0.950 (as the traffic label) - you can
> >>> test it on your own - since this bridge is used only to extract child
> >>> bond0.950 interface name, then based on vxlan ID, ACS will provision
> >>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
> created
> >>> (and then of course vNIC goes to this new bridge), so original bridge
> (to
> >>> which bond0.xxx belonged) is not used for anything.
> >>>
> >>
> >> Clear, I indeed thought something like that would happen.
> >>
> >>> Here is sample from above for vxlan 867 used for tenant isolation:
> >>>
> >>> root@hostname:~# brctl show brvx-867
> >>>
> >>> bridge name     bridge id               STP enabled     interfaces
> >>> brvx-867                8000.2215cfce99ce       no              vnet6
> >>>
> >>>      vxlan867
> >>>
> >>> root@hostname:~# ip -d link show vxlan867
> >>>
> >>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
> >>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
> >>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
> >>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
> 300
> >>>
> >>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
> >>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> >>>
> >>> So note how the vxlan interface has by 50 bytes smaller MTU than the
> >>> bond0.950 parent interface (which could affects traffic inside VM) - so
> >>> jumbo frames are needed anyway on the parent interface (bond.950 in
> example
> >>> above with minimum of 1550 MTU)
> >>>
> >>
> >> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> >> networks underneath will be ~9k.
> >>
> >>> Ping me if more details needed, happy to help.
> >>>
> >>
> >> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> >> experiences later.
> >>
> >> Wido
> >>
> >>> Cheers
> >>> Andrija
> >>>
> >>
> rohit.yadav@shapeblue.com
> www.shapeblue.com<http://www.shapeblue.com>
> Amadeus House, Floral Street, London  WC2E 9DPUK
> @shapeblue
>
>
>
> > On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I just wanted to know if there are people out there using KVM with
> >>>> Advanced Networking and using VXLAN for different networks.
> >>>>
> >>>> Our main goal would be to spawn a VM and based on the network the NIC
> is
> >>>> in attach it to a different VXLAN bridge on the KVM host.
> >>>>
> >>>> It seems to me that this should work, but I just wanted to check and
> see
> >>>> if people have experience with it.
> >>>>
> >>>> Wido
> >>>>
> >>>
> >>>
> >>
> >
>


--

Andrija Panić

Re: VXLAN and KVm experiences

Posted by Andrija Panic <an...@gmail.com>.
Define slow please ? - MTU for parent interface of all vxlan interfaces is
set to 1550 or more (vxlan interface MTU == 50 bytes less than parent
interface) ?
 Can you check if LRO is disabled on physical nics - with LRO issues (spent
2 days of my life for this back in the days...) ping is working fine, but
any larger packer goes to almost zero KB/s... (Ubuntu thing btw...)

Cheers


On Mon, 19 Nov 2018 at 14:36, Rohit Yadav <ro...@shapeblue.com> wrote:

> All,
>
> I need some pointers around vxlan debugging and configuration: (sorry for
> the long email)
>
> I'm working on a concept CI system where the idea is to setup CloudStack
> with kvm hosts and use vxlan isolation for guest, mgmt  and public
> networks, and then run CI jobs as CloudStack projects where monkeybox VMs
> (nested kvm VMs) run in isolated networks and are used to test a CloudStack
> build/branch/PR.
>
> I've two Ubuntu 18.04.1 based i7 mini pcs running KVM, where there is a
> single bridge/nic cloudbr0 to carry public, guest and mgmt network that is
> vxlan based. I've set max_igmp_memberships to 200 and to see console proxy
> etc I used vxlan://untagged for the public IP address range. The gigabit
> switch between them does not support igmp snooping. Now the problem is that
> in the nested VMs in an isolated network (VRs public nic plugs into
> cloudbr0, and guest nic plugs into a bridge that has vxlan end point for
> some VNI) , the download speed from public network is very slow. I've
> enabled the default udp port for vxlan on both hosts. How do I debug
> vxlans, what's going wrong? (do note that I've a single bridge for all
> those networks, with no vlans)
>
>
> Regards,
> Rohit Yadav
>
> ________________________________
> From: Simon Weller <sw...@ena.com.INVALID>
> Sent: Wednesday, November 14, 2018 10:55:18 PM
> To: Wido den Hollander; dev@cloudstack.apache.org
> Subject: Re: VXLAN and KVm experiences
>
> Wido,
>
>
> Here is the original document on the implemention for VXLAN in ACS -
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Linux+native+VXLAN+support+on+KVM+hypervisor
>
> It may shed some light on the reasons for the different multicast groups.
>
>
> - Si
>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, November 13, 2018 4:40 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 2:34 PM, Simon Weller wrote:
> > Linux native VXLAN uses multicast and each host has to participate in
> multicast in order to see the VXLAN networks. We haven't tried using PIM
> across a L3 boundary with ACS, although it will probably work fine.
> >
> > Another option is to use a L3 VTEP, but right now there is no native
> support for that in CloudStack's VXLAN implementation, although we've
> thought about proposing it as feature.
> >
>
> Getting back to this I see CloudStack does this:
>
> local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
> 256 )).$(( $vxlanId % 256 ))"
>
> VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.
>
> Why are we using a different mcast group for every VNI? As the VNI is
> encoded in the packet this should just work in one group, right?
>
> Because this way you need to configure all those groups on your
> Router(s) as each VNI will use a different Multicast Group.
>
> I'm just looking for the reason why we have this different multicast
> groups.
>
> I was thinking that we might want to add a option to agent.properties
> where we allow users to set a fixed Multicast group for all traffic.
>
> Wido
>
> [0]:
>
> https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33
>
>
>
> >
> > ________________________________
> > From: Wido den Hollander <wi...@widodh.nl>
> > Sent: Tuesday, October 23, 2018 7:17 AM
> > To: dev@cloudstack.apache.org; Simon Weller
> > Subject: Re: VXLAN and KVm experiences
> >
> >
> >
> > On 10/23/18 1:51 PM, Simon Weller wrote:
> >> We've also been using VXLAN on KVM for all of our isolated VPC guest
> networks for quite a long time now. As Andrija pointed out, make sure you
> increase the max_igmp_memberships param and also put an ip address on each
> interface host VXLAN interface in the same subnet for all hosts that will
> share networking, or multicast won't work.
> >>
> >
> > Thanks! So you are saying that all hypervisors need to be in the same L2
> > network or are you routing the multicast?
> >
> > My idea was that each POD would be an isolated Layer 3 domain and that a
> > VNI would span over the different Layer 3 networks.
> >
> > I don't like STP and other Layer 2 loop-prevention systems.
> >
> > Wido
> >
> >>
> >> - Si
> >>
> >>
> >> ________________________________
> >> From: Wido den Hollander <wi...@widodh.nl>
> >> Sent: Tuesday, October 23, 2018 5:21 AM
> >> To: dev@cloudstack.apache.org
> >> Subject: Re: VXLAN and KVm experiences
> >>
> >>
> >>
> >> On 10/23/18 11:21 AM, Andrija Panic wrote:
> >>> Hi Wido,
> >>>
> >>> I have "pioneered" this one in production for last 3 years (and
> suffered a
> >>> nasty pain of silent drop of packages on kernel 3.X back in the days
> >>> because of being unaware of max_igmp_memberships kernel parameters, so
> I
> >>> have updated the manual long time ago).
> >>>
> >>> I never had any issues (beside above nasty one...) and it works very
> well.
> >>
> >> That's what I want to hear!
> >>
> >>> To avoid above issue that I described - you should increase
> >>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  -
> otherwise
> >>> with more than 20 vxlan interfaces, some of them will stay in down
> state
> >>> and have a hard traffic drop (with proper message in agent.log) with
> kernel
> >>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and
> also
> >>> pay attention to MTU size as well - anyway everything is in the manual
> (I
> >>> updated everything I though was missing) - so please check it.
> >>>
> >>
> >> Yes, the underlying network will all be 9000 bytes MTU.
> >>
> >>> Our example setup:
> >>>
> >>> We have i.e. bond.950 as the main VLAN which will carry all vxlan
> "tunnels"
> >>> - so this is defined as KVM traffic label. In our case it didn't make
> sense
> >>> to use bridge on top of this bond0.950 (as the traffic label) - you can
> >>> test it on your own - since this bridge is used only to extract child
> >>> bond0.950 interface name, then based on vxlan ID, ACS will provision
> >>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge
> created
> >>> (and then of course vNIC goes to this new bridge), so original bridge
> (to
> >>> which bond0.xxx belonged) is not used for anything.
> >>>
> >>
> >> Clear, I indeed thought something like that would happen.
> >>
> >>> Here is sample from above for vxlan 867 used for tenant isolation:
> >>>
> >>> root@hostname:~# brctl show brvx-867
> >>>
> >>> bridge name     bridge id               STP enabled     interfaces
> >>> brvx-867                8000.2215cfce99ce       no              vnet6
> >>>
> >>>      vxlan867
> >>>
> >>> root@hostname:~# ip -d link show vxlan867
> >>>
> >>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
> >>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
> >>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
> >>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing
> 300
> >>>
> >>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
> >>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> >>>
> >>> So note how the vxlan interface has by 50 bytes smaller MTU than the
> >>> bond0.950 parent interface (which could affects traffic inside VM) - so
> >>> jumbo frames are needed anyway on the parent interface (bond.950 in
> example
> >>> above with minimum of 1550 MTU)
> >>>
> >>
> >> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> >> networks underneath will be ~9k.
> >>
> >>> Ping me if more details needed, happy to help.
> >>>
> >>
> >> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> >> experiences later.
> >>
> >> Wido
> >>
> >>> Cheers
> >>> Andrija
> >>>
> >>
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> Amadeus House, Floral Street, London  WC2E 9DPUK
> @shapeblue
>
>
>
> > On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I just wanted to know if there are people out there using KVM with
> >>>> Advanced Networking and using VXLAN for different networks.
> >>>>
> >>>> Our main goal would be to spawn a VM and based on the network the NIC
> is
> >>>> in attach it to a different VXLAN bridge on the KVM host.
> >>>>
> >>>> It seems to me that this should work, but I just wanted to check and
> see
> >>>> if people have experience with it.
> >>>>
> >>>> Wido
> >>>>
> >>>
> >>>
> >>
> >
>


-- 

Andrija Panić

Re: VXLAN and KVm experiences

Posted by Rohit Yadav <ro...@shapeblue.com>.
All,

I need some pointers around vxlan debugging and configuration: (sorry for the long email)

I'm working on a concept CI system where the idea is to setup CloudStack with kvm hosts and use vxlan isolation for guest, mgmt  and public networks, and then run CI jobs as CloudStack projects where monkeybox VMs (nested kvm VMs) run in isolated networks and are used to test a CloudStack build/branch/PR.

I've two Ubuntu 18.04.1 based i7 mini pcs running KVM, where there is a single bridge/nic cloudbr0 to carry public, guest and mgmt network that is vxlan based. I've set max_igmp_memberships to 200 and to see console proxy etc I used vxlan://untagged for the public IP address range. The gigabit switch between them does not support igmp snooping. Now the problem is that in the nested VMs in an isolated network (VRs public nic plugs into cloudbr0, and guest nic plugs into a bridge that has vxlan end point for some VNI) , the download speed from public network is very slow. I've enabled the default udp port for vxlan on both hosts. How do I debug vxlans, what's going wrong? (do note that I've a single bridge for all those networks, with no vlans)


Regards,
Rohit Yadav

________________________________
From: Simon Weller <sw...@ena.com.INVALID>
Sent: Wednesday, November 14, 2018 10:55:18 PM
To: Wido den Hollander; dev@cloudstack.apache.org
Subject: Re: VXLAN and KVm experiences

Wido,


Here is the original document on the implemention for VXLAN in ACS - https://cwiki.apache.org/confluence/display/CLOUDSTACK/Linux+native+VXLAN+support+on+KVM+hypervisor

It may shed some light on the reasons for the different multicast groups.


- Si

________________________________
From: Wido den Hollander <wi...@widodh.nl>
Sent: Tuesday, November 13, 2018 4:40 AM
To: dev@cloudstack.apache.org; Simon Weller
Subject: Re: VXLAN and KVm experiences



On 10/23/18 2:34 PM, Simon Weller wrote:
> Linux native VXLAN uses multicast and each host has to participate in multicast in order to see the VXLAN networks. We haven't tried using PIM across a L3 boundary with ACS, although it will probably work fine.
>
> Another option is to use a L3 VTEP, but right now there is no native support for that in CloudStack's VXLAN implementation, although we've thought about proposing it as feature.
>

Getting back to this I see CloudStack does this:

local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
256 )).$(( $vxlanId % 256 ))"

VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.

Why are we using a different mcast group for every VNI? As the VNI is
encoded in the packet this should just work in one group, right?

Because this way you need to configure all those groups on your
Router(s) as each VNI will use a different Multicast Group.

I'm just looking for the reason why we have this different multicast groups.

I was thinking that we might want to add a option to agent.properties
where we allow users to set a fixed Multicast group for all traffic.

Wido

[0]:
https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33



>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 7:17 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 1:51 PM, Simon Weller wrote:
>> We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast won't work.
>>
>
> Thanks! So you are saying that all hypervisors need to be in the same L2
> network or are you routing the multicast?
>
> My idea was that each POD would be an isolated Layer 3 domain and that a
> VNI would span over the different Layer 3 networks.
>
> I don't like STP and other Layer 2 loop-prevention systems.
>
> Wido
>
>>
>> - Si
>>
>>
>> ________________________________
>> From: Wido den Hollander <wi...@widodh.nl>
>> Sent: Tuesday, October 23, 2018 5:21 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: VXLAN and KVm experiences
>>
>>
>>
>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>>> Hi Wido,
>>>
>>> I have "pioneered" this one in production for last 3 years (and suffered a
>>> nasty pain of silent drop of packages on kernel 3.X back in the days
>>> because of being unaware of max_igmp_memberships kernel parameters, so I
>>> have updated the manual long time ago).
>>>
>>> I never had any issues (beside above nasty one...) and it works very well.
>>
>> That's what I want to hear!
>>
>>> To avoid above issue that I described - you should increase
>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>>> with more than 20 vxlan interfaces, some of them will stay in down state
>>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>>> pay attention to MTU size as well - anyway everything is in the manual (I
>>> updated everything I though was missing) - so please check it.
>>>
>>
>> Yes, the underlying network will all be 9000 bytes MTU.
>>
>>> Our example setup:
>>>
>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>>> - so this is defined as KVM traffic label. In our case it didn't make sense
>>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>>> test it on your own - since this bridge is used only to extract child
>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>>> (and then of course vNIC goes to this new bridge), so original bridge (to
>>> which bond0.xxx belonged) is not used for anything.
>>>
>>
>> Clear, I indeed thought something like that would happen.
>>
>>> Here is sample from above for vxlan 867 used for tenant isolation:
>>>
>>> root@hostname:~# brctl show brvx-867
>>>
>>> bridge name     bridge id               STP enabled     interfaces
>>> brvx-867                8000.2215cfce99ce       no              vnet6
>>>
>>>      vxlan867
>>>
>>> root@hostname:~# ip -d link show vxlan867
>>>
>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>>
>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>>
>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>>> bond0.950 parent interface (which could affects traffic inside VM) - so
>>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>>> above with minimum of 1550 MTU)
>>>
>>
>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>> networks underneath will be ~9k.
>>
>>> Ping me if more details needed, happy to help.
>>>
>>
>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>> experiences later.
>>
>> Wido
>>
>>> Cheers
>>> Andrija
>>>
>>
rohit.yadav@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue
  
 

> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just wanted to know if there are people out there using KVM with
>>>> Advanced Networking and using VXLAN for different networks.
>>>>
>>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>>> in attach it to a different VXLAN bridge on the KVM host.
>>>>
>>>> It seems to me that this should work, but I just wanted to check and see
>>>> if people have experience with it.
>>>>
>>>> Wido
>>>>
>>>
>>>
>>
>

Re: VXLAN and KVm experiences

Posted by Simon Weller <sw...@ena.com.INVALID>.
Wido,


Here is the original document on the implemention for VXLAN in ACS - https://cwiki.apache.org/confluence/display/CLOUDSTACK/Linux+native+VXLAN+support+on+KVM+hypervisor

It may shed some light on the reasons for the different multicast groups.


- Si

________________________________
From: Wido den Hollander <wi...@widodh.nl>
Sent: Tuesday, November 13, 2018 4:40 AM
To: dev@cloudstack.apache.org; Simon Weller
Subject: Re: VXLAN and KVm experiences



On 10/23/18 2:34 PM, Simon Weller wrote:
> Linux native VXLAN uses multicast and each host has to participate in multicast in order to see the VXLAN networks. We haven't tried using PIM across a L3 boundary with ACS, although it will probably work fine.
>
> Another option is to use a L3 VTEP, but right now there is no native support for that in CloudStack's VXLAN implementation, although we've thought about proposing it as feature.
>

Getting back to this I see CloudStack does this:

local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
256 )).$(( $vxlanId % 256 ))"

VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.

Why are we using a different mcast group for every VNI? As the VNI is
encoded in the packet this should just work in one group, right?

Because this way you need to configure all those groups on your
Router(s) as each VNI will use a different Multicast Group.

I'm just looking for the reason why we have this different multicast groups.

I was thinking that we might want to add a option to agent.properties
where we allow users to set a fixed Multicast group for all traffic.

Wido

[0]:
https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33



>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 7:17 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 1:51 PM, Simon Weller wrote:
>> We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast won't work.
>>
>
> Thanks! So you are saying that all hypervisors need to be in the same L2
> network or are you routing the multicast?
>
> My idea was that each POD would be an isolated Layer 3 domain and that a
> VNI would span over the different Layer 3 networks.
>
> I don't like STP and other Layer 2 loop-prevention systems.
>
> Wido
>
>>
>> - Si
>>
>>
>> ________________________________
>> From: Wido den Hollander <wi...@widodh.nl>
>> Sent: Tuesday, October 23, 2018 5:21 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: VXLAN and KVm experiences
>>
>>
>>
>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>>> Hi Wido,
>>>
>>> I have "pioneered" this one in production for last 3 years (and suffered a
>>> nasty pain of silent drop of packages on kernel 3.X back in the days
>>> because of being unaware of max_igmp_memberships kernel parameters, so I
>>> have updated the manual long time ago).
>>>
>>> I never had any issues (beside above nasty one...) and it works very well.
>>
>> That's what I want to hear!
>>
>>> To avoid above issue that I described - you should increase
>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>>> with more than 20 vxlan interfaces, some of them will stay in down state
>>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>>> pay attention to MTU size as well - anyway everything is in the manual (I
>>> updated everything I though was missing) - so please check it.
>>>
>>
>> Yes, the underlying network will all be 9000 bytes MTU.
>>
>>> Our example setup:
>>>
>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>>> - so this is defined as KVM traffic label. In our case it didn't make sense
>>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>>> test it on your own - since this bridge is used only to extract child
>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>>> (and then of course vNIC goes to this new bridge), so original bridge (to
>>> which bond0.xxx belonged) is not used for anything.
>>>
>>
>> Clear, I indeed thought something like that would happen.
>>
>>> Here is sample from above for vxlan 867 used for tenant isolation:
>>>
>>> root@hostname:~# brctl show brvx-867
>>>
>>> bridge name     bridge id               STP enabled     interfaces
>>> brvx-867                8000.2215cfce99ce       no              vnet6
>>>
>>>      vxlan867
>>>
>>> root@hostname:~# ip -d link show vxlan867
>>>
>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>>
>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>>
>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>>> bond0.950 parent interface (which could affects traffic inside VM) - so
>>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>>> above with minimum of 1550 MTU)
>>>
>>
>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>> networks underneath will be ~9k.
>>
>>> Ping me if more details needed, happy to help.
>>>
>>
>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>> experiences later.
>>
>> Wido
>>
>>> Cheers
>>> Andrija
>>>
>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just wanted to know if there are people out there using KVM with
>>>> Advanced Networking and using VXLAN for different networks.
>>>>
>>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>>> in attach it to a different VXLAN bridge on the KVM host.
>>>>
>>>> It seems to me that this should work, but I just wanted to check and see
>>>> if people have experience with it.
>>>>
>>>> Wido
>>>>
>>>
>>>
>>
>

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/23/18 2:34 PM, Simon Weller wrote:
> Linux native VXLAN uses multicast and each host has to participate in multicast in order to see the VXLAN networks. We haven't tried using PIM across a L3 boundary with ACS, although it will probably work fine.
> 
> Another option is to use a L3 VTEP, but right now there is no native support for that in CloudStack's VXLAN implementation, although we've thought about proposing it as feature.
> 

Getting back to this I see CloudStack does this:

local mcastGrp="239.$(( ($vxlanId >> 16) % 256 )).$(( ($vxlanId >> 8) %
256 )).$(( $vxlanId % 256 ))"

VNI 1000 would use group 239.0.3.232 and VNI 1001 uses 239.0.3.233 1000.

Why are we using a different mcast group for every VNI? As the VNI is
encoded in the packet this should just work in one group, right?

Because this way you need to configure all those groups on your
Router(s) as each VNI will use a different Multicast Group.

I'm just looking for the reason why we have this different multicast groups.

I was thinking that we might want to add a option to agent.properties
where we allow users to set a fixed Multicast group for all traffic.

Wido

[0]:
https://github.com/apache/cloudstack/blob/master/scripts/vm/network/vnet/modifyvxlan.sh#L33



> 
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 7:17 AM
> To: dev@cloudstack.apache.org; Simon Weller
> Subject: Re: VXLAN and KVm experiences
> 
> 
> 
> On 10/23/18 1:51 PM, Simon Weller wrote:
>> We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast won't work.
>>
> 
> Thanks! So you are saying that all hypervisors need to be in the same L2
> network or are you routing the multicast?
> 
> My idea was that each POD would be an isolated Layer 3 domain and that a
> VNI would span over the different Layer 3 networks.
> 
> I don't like STP and other Layer 2 loop-prevention systems.
> 
> Wido
> 
>>
>> - Si
>>
>>
>> ________________________________
>> From: Wido den Hollander <wi...@widodh.nl>
>> Sent: Tuesday, October 23, 2018 5:21 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: VXLAN and KVm experiences
>>
>>
>>
>> On 10/23/18 11:21 AM, Andrija Panic wrote:
>>> Hi Wido,
>>>
>>> I have "pioneered" this one in production for last 3 years (and suffered a
>>> nasty pain of silent drop of packages on kernel 3.X back in the days
>>> because of being unaware of max_igmp_memberships kernel parameters, so I
>>> have updated the manual long time ago).
>>>
>>> I never had any issues (beside above nasty one...) and it works very well.
>>
>> That's what I want to hear!
>>
>>> To avoid above issue that I described - you should increase
>>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>>> with more than 20 vxlan interfaces, some of them will stay in down state
>>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>>> pay attention to MTU size as well - anyway everything is in the manual (I
>>> updated everything I though was missing) - so please check it.
>>>
>>
>> Yes, the underlying network will all be 9000 bytes MTU.
>>
>>> Our example setup:
>>>
>>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>>> - so this is defined as KVM traffic label. In our case it didn't make sense
>>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>>> test it on your own - since this bridge is used only to extract child
>>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>>> (and then of course vNIC goes to this new bridge), so original bridge (to
>>> which bond0.xxx belonged) is not used for anything.
>>>
>>
>> Clear, I indeed thought something like that would happen.
>>
>>> Here is sample from above for vxlan 867 used for tenant isolation:
>>>
>>> root@hostname:~# brctl show brvx-867
>>>
>>> bridge name     bridge id               STP enabled     interfaces
>>> brvx-867                8000.2215cfce99ce       no              vnet6
>>>
>>>      vxlan867
>>>
>>> root@hostname:~# ip -d link show vxlan867
>>>
>>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>>
>>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>>
>>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>>> bond0.950 parent interface (which could affects traffic inside VM) - so
>>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>>> above with minimum of 1550 MTU)
>>>
>>
>> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
>> networks underneath will be ~9k.
>>
>>> Ping me if more details needed, happy to help.
>>>
>>
>> Awesome! We'll be doing a PoC rather soon. I'll come back with our
>> experiences later.
>>
>> Wido
>>
>>> Cheers
>>> Andrija
>>>
>>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just wanted to know if there are people out there using KVM with
>>>> Advanced Networking and using VXLAN for different networks.
>>>>
>>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>>> in attach it to a different VXLAN bridge on the KVM host.
>>>>
>>>> It seems to me that this should work, but I just wanted to check and see
>>>> if people have experience with it.
>>>>
>>>> Wido
>>>>
>>>
>>>
>>
> 

Re: VXLAN and KVm experiences

Posted by Simon Weller <sw...@ena.com.INVALID>.
Linux native VXLAN uses multicast and each host has to participate in multicast in order to see the VXLAN networks. We haven't tried using PIM across a L3 boundary with ACS, although it will probably work fine.

Another option is to use a L3 VTEP, but right now there is no native support for that in CloudStack's VXLAN implementation, although we've thought about proposing it as feature.


________________________________
From: Wido den Hollander <wi...@widodh.nl>
Sent: Tuesday, October 23, 2018 7:17 AM
To: dev@cloudstack.apache.org; Simon Weller
Subject: Re: VXLAN and KVm experiences



On 10/23/18 1:51 PM, Simon Weller wrote:
> We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast won't work.
>

Thanks! So you are saying that all hypervisors need to be in the same L2
network or are you routing the multicast?

My idea was that each POD would be an isolated Layer 3 domain and that a
VNI would span over the different Layer 3 networks.

I don't like STP and other Layer 2 loop-prevention systems.

Wido

>
> - Si
>
>
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 5:21 AM
> To: dev@cloudstack.apache.org
> Subject: Re: VXLAN and KVm experiences
>
>
>
> On 10/23/18 11:21 AM, Andrija Panic wrote:
>> Hi Wido,
>>
>> I have "pioneered" this one in production for last 3 years (and suffered a
>> nasty pain of silent drop of packages on kernel 3.X back in the days
>> because of being unaware of max_igmp_memberships kernel parameters, so I
>> have updated the manual long time ago).
>>
>> I never had any issues (beside above nasty one...) and it works very well.
>
> That's what I want to hear!
>
>> To avoid above issue that I described - you should increase
>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>> with more than 20 vxlan interfaces, some of them will stay in down state
>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>> pay attention to MTU size as well - anyway everything is in the manual (I
>> updated everything I though was missing) - so please check it.
>>
>
> Yes, the underlying network will all be 9000 bytes MTU.
>
>> Our example setup:
>>
>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>> - so this is defined as KVM traffic label. In our case it didn't make sense
>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>> test it on your own - since this bridge is used only to extract child
>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>> (and then of course vNIC goes to this new bridge), so original bridge (to
>> which bond0.xxx belonged) is not used for anything.
>>
>
> Clear, I indeed thought something like that would happen.
>
>> Here is sample from above for vxlan 867 used for tenant isolation:
>>
>> root@hostname:~# brctl show brvx-867
>>
>> bridge name     bridge id               STP enabled     interfaces
>> brvx-867                8000.2215cfce99ce       no              vnet6
>>
>>      vxlan867
>>
>> root@hostname:~# ip -d link show vxlan867
>>
>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>
>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>
>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>> bond0.950 parent interface (which could affects traffic inside VM) - so
>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>> above with minimum of 1550 MTU)
>>
>
> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> networks underneath will be ~9k.
>
>> Ping me if more details needed, happy to help.
>>
>
> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> experiences later.
>
> Wido
>
>> Cheers
>> Andrija
>>
>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>
>>> Hi,
>>>
>>> I just wanted to know if there are people out there using KVM with
>>> Advanced Networking and using VXLAN for different networks.
>>>
>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>> in attach it to a different VXLAN bridge on the KVM host.
>>>
>>> It seems to me that this should work, but I just wanted to check and see
>>> if people have experience with it.
>>>
>>> Wido
>>>
>>
>>
>

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/23/18 1:51 PM, Simon Weller wrote:
> We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast won't work.
> 

Thanks! So you are saying that all hypervisors need to be in the same L2
network or are you routing the multicast?

My idea was that each POD would be an isolated Layer 3 domain and that a
VNI would span over the different Layer 3 networks.

I don't like STP and other Layer 2 loop-prevention systems.

Wido

> 
> - Si
> 
> 
> ________________________________
> From: Wido den Hollander <wi...@widodh.nl>
> Sent: Tuesday, October 23, 2018 5:21 AM
> To: dev@cloudstack.apache.org
> Subject: Re: VXLAN and KVm experiences
> 
> 
> 
> On 10/23/18 11:21 AM, Andrija Panic wrote:
>> Hi Wido,
>>
>> I have "pioneered" this one in production for last 3 years (and suffered a
>> nasty pain of silent drop of packages on kernel 3.X back in the days
>> because of being unaware of max_igmp_memberships kernel parameters, so I
>> have updated the manual long time ago).
>>
>> I never had any issues (beside above nasty one...) and it works very well.
> 
> That's what I want to hear!
> 
>> To avoid above issue that I described - you should increase
>> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
>> with more than 20 vxlan interfaces, some of them will stay in down state
>> and have a hard traffic drop (with proper message in agent.log) with kernel
>>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
>> pay attention to MTU size as well - anyway everything is in the manual (I
>> updated everything I though was missing) - so please check it.
>>
> 
> Yes, the underlying network will all be 9000 bytes MTU.
> 
>> Our example setup:
>>
>> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
>> - so this is defined as KVM traffic label. In our case it didn't make sense
>> to use bridge on top of this bond0.950 (as the traffic label) - you can
>> test it on your own - since this bridge is used only to extract child
>> bond0.950 interface name, then based on vxlan ID, ACS will provision
>> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
>> (and then of course vNIC goes to this new bridge), so original bridge (to
>> which bond0.xxx belonged) is not used for anything.
>>
> 
> Clear, I indeed thought something like that would happen.
> 
>> Here is sample from above for vxlan 867 used for tenant isolation:
>>
>> root@hostname:~# brctl show brvx-867
>>
>> bridge name     bridge id               STP enabled     interfaces
>> brvx-867                8000.2215cfce99ce       no              vnet6
>>
>>      vxlan867
>>
>> root@hostname:~# ip -d link show vxlan867
>>
>> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
>> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>>
>> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>>
>> So note how the vxlan interface has by 50 bytes smaller MTU than the
>> bond0.950 parent interface (which could affects traffic inside VM) - so
>> jumbo frames are needed anyway on the parent interface (bond.950 in example
>> above with minimum of 1550 MTU)
>>
> 
> Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
> networks underneath will be ~9k.
> 
>> Ping me if more details needed, happy to help.
>>
> 
> Awesome! We'll be doing a PoC rather soon. I'll come back with our
> experiences later.
> 
> Wido
> 
>> Cheers
>> Andrija
>>
>> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>>
>>> Hi,
>>>
>>> I just wanted to know if there are people out there using KVM with
>>> Advanced Networking and using VXLAN for different networks.
>>>
>>> Our main goal would be to spawn a VM and based on the network the NIC is
>>> in attach it to a different VXLAN bridge on the KVM host.
>>>
>>> It seems to me that this should work, but I just wanted to check and see
>>> if people have experience with it.
>>>
>>> Wido
>>>
>>
>>
> 

Re: VXLAN and KVm experiences

Posted by Simon Weller <sw...@ena.com.INVALID>.
We've also been using VXLAN on KVM for all of our isolated VPC guest networks for quite a long time now. As Andrija pointed out, make sure you increase the max_igmp_memberships param and also put an ip address on each interface host VXLAN interface in the same subnet for all hosts that will share networking, or multicast won't work.


- Si


________________________________
From: Wido den Hollander <wi...@widodh.nl>
Sent: Tuesday, October 23, 2018 5:21 AM
To: dev@cloudstack.apache.org
Subject: Re: VXLAN and KVm experiences



On 10/23/18 11:21 AM, Andrija Panic wrote:
> Hi Wido,
>
> I have "pioneered" this one in production for last 3 years (and suffered a
> nasty pain of silent drop of packages on kernel 3.X back in the days
> because of being unaware of max_igmp_memberships kernel parameters, so I
> have updated the manual long time ago).
>
> I never had any issues (beside above nasty one...) and it works very well.

That's what I want to hear!

> To avoid above issue that I described - you should increase
> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
> with more than 20 vxlan interfaces, some of them will stay in down state
> and have a hard traffic drop (with proper message in agent.log) with kernel
>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
> pay attention to MTU size as well - anyway everything is in the manual (I
> updated everything I though was missing) - so please check it.
>

Yes, the underlying network will all be 9000 bytes MTU.

> Our example setup:
>
> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
> - so this is defined as KVM traffic label. In our case it didn't make sense
> to use bridge on top of this bond0.950 (as the traffic label) - you can
> test it on your own - since this bridge is used only to extract child
> bond0.950 interface name, then based on vxlan ID, ACS will provision
> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
> (and then of course vNIC goes to this new bridge), so original bridge (to
> which bond0.xxx belonged) is not used for anything.
>

Clear, I indeed thought something like that would happen.

> Here is sample from above for vxlan 867 used for tenant isolation:
>
> root@hostname:~# brctl show brvx-867
>
> bridge name     bridge id               STP enabled     interfaces
> brvx-867                8000.2215cfce99ce       no              vnet6
>
>      vxlan867
>
> root@hostname:~# ip -d link show vxlan867
>
> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
>
> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
>
> So note how the vxlan interface has by 50 bytes smaller MTU than the
> bond0.950 parent interface (which could affects traffic inside VM) - so
> jumbo frames are needed anyway on the parent interface (bond.950 in example
> above with minimum of 1550 MTU)
>

Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
networks underneath will be ~9k.

> Ping me if more details needed, happy to help.
>

Awesome! We'll be doing a PoC rather soon. I'll come back with our
experiences later.

Wido

> Cheers
> Andrija
>
> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
>
>> Hi,
>>
>> I just wanted to know if there are people out there using KVM with
>> Advanced Networking and using VXLAN for different networks.
>>
>> Our main goal would be to spawn a VM and based on the network the NIC is
>> in attach it to a different VXLAN bridge on the KVM host.
>>
>> It seems to me that this should work, but I just wanted to check and see
>> if people have experience with it.
>>
>> Wido
>>
>
>

Re: VXLAN and KVm experiences

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/23/18 11:21 AM, Andrija Panic wrote:
> Hi Wido,
> 
> I have "pioneered" this one in production for last 3 years (and suffered a
> nasty pain of silent drop of packages on kernel 3.X back in the days
> because of being unaware of max_igmp_memberships kernel parameters, so I
> have updated the manual long time ago).
> 
> I never had any issues (beside above nasty one...) and it works very well.

That's what I want to hear!

> To avoid above issue that I described - you should increase
> max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
> with more than 20 vxlan interfaces, some of them will stay in down state
> and have a hard traffic drop (with proper message in agent.log) with kernel
>> 4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
> pay attention to MTU size as well - anyway everything is in the manual (I
> updated everything I though was missing) - so please check it.
> 

Yes, the underlying network will all be 9000 bytes MTU.

> Our example setup:
> 
> We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
> - so this is defined as KVM traffic label. In our case it didn't make sense
> to use bridge on top of this bond0.950 (as the traffic label) - you can
> test it on your own - since this bridge is used only to extract child
> bond0.950 interface name, then based on vxlan ID, ACS will provision
> vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
> (and then of course vNIC goes to this new bridge), so original bridge (to
> which bond0.xxx belonged) is not used for anything.
> 

Clear, I indeed thought something like that would happen.

> Here is sample from above for vxlan 867 used for tenant isolation:
> 
> root@hostname:~# brctl show brvx-867
> 
> bridge name     bridge id               STP enabled     interfaces
> brvx-867                8000.2215cfce99ce       no              vnet6
> 
>      vxlan867
> 
> root@hostname:~# ip -d link show vxlan867
> 
> 297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
> master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
>     vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300
> 
> root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
>           UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1
> 
> So note how the vxlan interface has by 50 bytes smaller MTU than the
> bond0.950 parent interface (which could affects traffic inside VM) - so
> jumbo frames are needed anyway on the parent interface (bond.950 in example
> above with minimum of 1550 MTU)
> 

Yes, thanks! We will be using 1500 MTU inside the VMs, so all the
networks underneath will be ~9k.

> Ping me if more details needed, happy to help.
> 

Awesome! We'll be doing a PoC rather soon. I'll come back with our
experiences later.

Wido

> Cheers
> Andrija
> 
> On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:
> 
>> Hi,
>>
>> I just wanted to know if there are people out there using KVM with
>> Advanced Networking and using VXLAN for different networks.
>>
>> Our main goal would be to spawn a VM and based on the network the NIC is
>> in attach it to a different VXLAN bridge on the KVM host.
>>
>> It seems to me that this should work, but I just wanted to check and see
>> if people have experience with it.
>>
>> Wido
>>
> 
> 

Re: VXLAN and KVm experiences

Posted by Andrija Panic <an...@gmail.com>.
Hi Wido,

I have "pioneered" this one in production for last 3 years (and suffered a
nasty pain of silent drop of packages on kernel 3.X back in the days
because of being unaware of max_igmp_memberships kernel parameters, so I
have updated the manual long time ago).

I never had any issues (beside above nasty one...) and it works very well.
To avoid above issue that I described - you should increase
max_igmp_memberships (/proc/sys/net/ipv4/igmp_max_memberships)  - otherwise
with more than 20 vxlan interfaces, some of them will stay in down state
and have a hard traffic drop (with proper message in agent.log) with kernel
>4.0 (or I silent, bitchy random packet drop on kernel 3.X...) - and also
pay attention to MTU size as well - anyway everything is in the manual (I
updated everything I though was missing) - so please check it.

Our example setup:

We have i.e. bond.950 as the main VLAN which will carry all vxlan "tunnels"
- so this is defined as KVM traffic label. In our case it didn't make sense
to use bridge on top of this bond0.950 (as the traffic label) - you can
test it on your own - since this bridge is used only to extract child
bond0.950 interface name, then based on vxlan ID, ACS will provision
vxlanYYY@bond0.xxx and join this new vxlan interface to NEW bridge created
(and then of course vNIC goes to this new bridge), so original bridge (to
which bond0.xxx belonged) is not used for anything.

Here is sample from above for vxlan 867 used for tenant isolation:

root@hostname:~# brctl show brvx-867

bridge name     bridge id               STP enabled     interfaces
brvx-867                8000.2215cfce99ce       no              vnet6

     vxlan867

root@hostname:~# ip -d link show vxlan867

297: vxlan867: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8142 qdisc noqueue
master brvx-867 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 22:15:cf:ce:99:ce brd ff:ff:ff:ff:ff:ff promiscuity 1
    vxlan id 867 group 239.0.3.99 dev bond0.950 port 0 0 ttl 10 ageing 300

root@ix1-c7-2:~# ifconfig bond0.950 | grep MTU
          UP BROADCAST RUNNING MULTICAST  MTU:8192  Metric:1

So note how the vxlan interface has by 50 bytes smaller MTU than the
bond0.950 parent interface (which could affects traffic inside VM) - so
jumbo frames are needed anyway on the parent interface (bond.950 in example
above with minimum of 1550 MTU)

Ping me if more details needed, happy to help.

Cheers
Andrija

On Tue, 23 Oct 2018 at 08:23, Wido den Hollander <wi...@widodh.nl> wrote:

> Hi,
>
> I just wanted to know if there are people out there using KVM with
> Advanced Networking and using VXLAN for different networks.
>
> Our main goal would be to spawn a VM and based on the network the NIC is
> in attach it to a different VXLAN bridge on the KVM host.
>
> It seems to me that this should work, but I just wanted to check and see
> if people have experience with it.
>
> Wido
>


-- 

Andrija Panić