Getting started with Ansible and Azure

This post stated, like most of my posts, with me making notes as I went along trying to accomplish a technical task. I quickly became a bit of a rant as I encountered numerous issues with Ansible’s Azure integration.

Beginning with a moan

When I started this post, I thought I had found a minor problem that I would resolve (or find a workaround for) and carry on. It was not to be, and in the end (as you will see, if you continue reading from the next heading) I adopted a completely different approach.

The first thing I tried to do was to deploy a VM - this is a regular task, as I create a new VM for every user (so they have their own sandboxed workstation, amongst other reasons). Automating this process better, particularly to reduce duplication between the deployment and configuration scripts, will significantly reduce the maintenance overhead (and free up my time).

Almost immediately, I hit a slight snag with the azure.azcollection.azure_rm_networkinterface module. In order to create the VM, I first need to create a NIC for it with the IP configuration I desired. Initially I thought this may just be unclear documentation, rather than a bug. The documentation for the option create_with_security_group says:

Whether a security group should be be created with the NIC. If this flag set to True and no security_group set, a default security group will be created.

To me, this reads that if set to True a new security group will be created. This is not the behaviour I wanted, as I wanted to associate an existing security group. The documentation for the option security_group says:

An existing security group with which to associate the network interface. If not provided, a default security group will be created when create_with_security_group=true. It can be the name of security group. Make sure the security group is in the same resource group when you only give its name. It can be the resource id. It can be a dict contains security_group’s name and resource_group.

This seems straightforward enough - if create_with_security_group is enabled then a new security group will be created (“Whether a security group should be be created”(sic)) and security_group on its own would just associate an existing security group. However this is not what happens - unless create_with_security_group is set to True then the interface’s network security group is set to None. That’s right, the existing group is removed if there is one. No matter what I set for security_group, with create_with_security_group set to False the NSG on the interface is removed.

I suspect the documentation for create_with_security_group should read:

Whether a security group should be associated with the NIC If this flag is set to False the NIC will have any associated security group removed. IF this flag is set to True and no security_group set, a default security group will be created.

This seems to be an old issue from before the collection was spun out of ansible’s bundled modules.

However, once I had resolved that, with create_with_security_group set to True and security_group correctly defined, the module started throwing an exception saying Error creating or updating network interface nicName - Parameter 'FlowLog.target_resource_id' can not be None. that I was unable to resolve (I was testing this with an existing VM and the configuration as it was currently so there should have been no change). I opened an issue for this and started thinking about another approach to enable me to get on with my work. (Update on the ticket: developers originally responded “works for me” however another user said they have the same problem in their environment and there are now duplicate issue reports for it.)

I have also recurrently encountered problems with conflicting python module versions of Microsoft’s modules between the Ansible collection and the azure-cli tool.

Another approach (“plan B”)

Having failed to get the Ansible “native” Azure modules working properly, I thought up with another plan - use ansible.builtin.cmd to do the operations via the az azure-cli command. This is less than satisfactory for a number of reasons:

It is not idempotent - even for commands (such as deployment with a template which matches the existing environment) az will effectively replace existing resources
The tasks involved will always show changed even if no changes are actually made (Ansible cannot tell if anything changed, so assumes all commands have side-effects and I used temporary files which are always created and deleted, resulting in changes)
It seems like a hack to use cmd instead of Ansible native modules (not least because of the above properties of doing so)

Sanity prevails (“plan C”)

In the end, I decided to separate the deployment (infrastructure-as-code) from configuration (configuration management/desired state configuration). Conceptually this has some advantages, logically separating the in-band OS configuration from the out-of-band “bare metal” (if we can call it that in the cloud/vm world) creation. However, it also means that the infrastructure is not maintained in the same way and while the configuration management tools should be used regularly and thus prevent configuration drift, it introduces a higher risk of the infrastructure drifting from the defined (and, presumably, desired) state without good discipline or robust CI/CD pipelines to ensure everything remains consistent with the definition.