Alerting and monitoring aspect in most solutions is an afterthought. This thinking can lead production issues and organisations missing their O/SLAs which they agreed with their business or other parties.
When it comes to Azure, we have a range of services to cover all aspect of monitoring-
- Azure Monitor-
A consolidated alerting and monitoring for Azure 1st party/core services.
- Azure Alerts-
By using alerts, you can configure conditions over data and get notified when the conditions match the latest monitoring data.
- Service Health-
It is a suite of experiences that provide personalized guidance and support when issues in Azure services affect you.
- Azure Security Center-
It provides unified security management and advanced threat protection across hybrid cloud workloads.
- Azure Advisors-
It is a personalized cloud consultant that helps you follow best practices to optimize your Azure deployments.
- Application Insights-
It is an extensible Application Performance Management (APM) service for web developers building and managing apps on multiple platforms.
- Log Analytics-
It plays a central role by consolidating monitoring data from different sources and providing a powerful query language (Kusto) for consolidation and analysis.
- Azure Mobile App-
App to manage your Azure resources, in essence, a portal in a phone.
Although, the above services can cover alerting and monitoring fully, it can be an overwhelming exercise to make them work together and understand their individual contributions to the overall solution.
To help understand how these services are connected together and can benefit you as a consumer of Azure services, I’ve put together a following diagram-
…Next up, surface information for different stakeholders, essentially create a dashboard as a product for each role.
You will often see this error pretty much right away when your Service Fabric cluster comes up if you are using a VMSS with VMs having smaller temporary disk sizes (d:\).
So what’s going on here?
By default Service Fabric’s reliable collections logs for reliable system services are stored in D:\SvcFab, to verify this you can remote desktop in to one of the VMs in VMSS where this warning is coming from. Most people will only see this warning in primary node type as the services you as a customer create are generally stateless and hence no stateful data logs are present on the non primary node types.
Default log size for replicator log (reliable collections) in MB is 8192 so if your temporary disk is 7GB (Standard_D2_v2) for example you will see the warning message in the cluster explorer as below-
Unhealthy event: SourceId=’FabricDCA’, Property=’DataCollectionAgent.DiskSpaceAvailable’, HealthState=’Warning’, ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.
How to fix this?
You can change the default replicator log size by adding the FabricSetting in the ARM template named “KtlLogger” like highlighted below, this file size does not change once configured (grow or shrink)-
For VM temporary disk sizes and specs, see here- https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general
More info around configuring reliable services\manifest is here-
These are some interesting benefits of using Service Fabric (SF) (over Cloud Services and in general)-
- High Density- Unlike cloud services you can run multiple services on a single VM saving you both cost and management overhead, SF will re-balance or scale out the cluster if resource contention is predicted or occurs.
- Any Cloud or Data Center- Service Fabric cluster can be deployed in Azure, on-premise or even in a 3rd party cloud if you need to due unforeseen change in company’s direction or regulatory requirements. It just runs better in Azure, why? Because certain services are provided in Azure as a value addition e.g. upgrade service.
- Any OS- Service Fabric cluster can run on both Windows and Linux. In near future, you will be able to have a single cluster running both Windows and Linux workloads in parallel.
- Faster Deployments- As you do not create a new VM per service like in cloud services, new services can be deployed to the existing VMs as per the configured placement constraints, making deployments much faster.
- Application Concept- In microservices world, multiple services working together forms a business function or an application. SF understands the concept of application than just individual services which constitutes a business function. SF treats and manages application and it’s services as one entity to maintain the health and consistency of the platform, unlike cloud services.
- Generic Orchestration Engine- Service Fabric can orchestrate both at process and container level should you need to. One technology to learn to rule them all.
- Stateful Services- A managed programming model to develop stateful services following the OOPs principle of encapsulation i.e. keeping state and operations as a unit. Other container orchestration engines cannot do this. And of course you can develop reliable stateless services as well.
- Multi-tenancy- Deploy multiple versions of the same application for multiple clients side by side or do a canary testing.
- Rolling Upgrades- Upgrade both applications and platform without any downtime with a sophisticated rolling upgrade feature set.
- Scalable- Scale to hundreds or thousands of VMs if you need to with auto scaling or manual.
- Secure- Inter VM encryption, cluster management authentication/authorization (RBAC), network level isolation are just a few ways to secure your cluster in the enterprise grade manner.
- Monitoring- Unlike cloud services SF comes with a built in OMS solution which understands the events raised by SF cluster and take appropriate action. Both inproc and out of proc logging is supported.
- Resource Manager Deployments– Unlike cloud services which still runs in a classic deployment model, SF cluster and applications uses resource manager way of deployments which are much more flexible and deploys only the artefacts you need.
- Pace of Innovation- Cloud services is an old platform, still used by many large organisations for critical workloads but it is not the platform which will get new innovative features in future.
More technical differences are here.
Notes from the field on Azure Service Fabric (ASF) and some less known facts-
- Why do we need a durability level Silver or Gold? Silver and Gold tier allows SF to integrate with underlying VMSS resulting in the following features-
- You can scale back the underlying VMSS after scaling it out and ASF will recognise this change and will not mark the cluster as unhealthy. Also note, you cannot scale back the nodes to anything below 5 even though when you create a cluster a Silver/Gold node type has only 3 nodes.
- Allows ASF to intercept and delay VM level actions requested by the platform or cluster admin to allow stateful services to maintain the minimum replicaset/quorum at any point in time.
- Can I change durability tier of the existing cluster/node type?
- Yes you can upgrade from lower levels to higher and from Gold -> Silver.
- Why do we need minimum of 5 nodes in primary node type? Because-
- To maintain the quorum, you need majority of the nodes running at any point in time in the primary node type. If you are upgrading ASF binaries to the new version and Microsoft decides to update the host machine which hosts one of your 3 VMs then in this situation you are 2 VMs down out of 3, so this will impact the stateful system services. If you had 5 VMs, taking out 2 of 5 will still have 3 (majority) available.
- Does Microsoft support ASF cluster spanning across the multiple Azure DCs?
- Generally speaking, it is not supported.
- Can I add/remove Node Types after the cluster is created?
- Can I scale ASF?
- Yes, via VMSS auto/manual scale mechanism only at present.
- Can I make unsecure cluster a secure cluster?
- Can I scale stateful services?
- Yes, by partitioning the data which allows multiple parallel service type instances receiving the requests for their respective partitions.
- Each application instance runs in isolation, with its own work directory and process.
- Service Fabric process runs in kernel mode hence applications running under it will not be able to crash it easily.
- By default, the cluster certificates are added to the allowed Admin certificates list hence the template here secures both node-node and client-node comms. You can though add separate client certs for readonly and admin cluster roles.
- Scaling out VMSS causes other nodes in the same VMSS to change to stopping state. This is a superficial UI bug and will be fixed soon, VMs in the VMSS do not stop in reality.
- Can I create a SF cluster with small size VMs?
- Yes, you can but please bear in mind when you do that cluster may start to throw the warnings, see this post to remove those warnings.
Recently, I was working with a large healthcare provider to design their public facing mission critical web platform. The customer already had ExpressRoute connection with a 50mb line, not super fast but it doesnt need to be either always.
Given the design divided the environments in their respective vNets, we ended up creating multiple vNets around 9 in Azure. Connecting each vNet to on-premise network via ExpressRoute would mean consuming 9 connections\links out of 10 (20 for premium sku). This was sub optimal as that will not leave any room for the future projects to utilise the same circuit on-premise connectivity, plus, expansion of additional evironments in future on the platform will be limited as well.
So how could you avoid this problem?
VNet Peering comes to rescue here, following diagram dipicts the topology which can be used to achieve the above design-
- You can also use transitive vnet (‘transitive’ is just a given name) for NVAs, implementing hub and spoke model.
- vNet Peering does not allow transitive routing i.e. if three vnets are peered in a chain, vnet1 ,vnet2 and vnet3 then vnet1 cannot talk to vnet3.
- vNet Peering is always created from both vNets to work, hence above diagram has two arrows for each vNet Peering.
- As all the vnets are connected to single VPN Gateway via ExpressRoute, bandwidth between on-prem and Azure vNets will limited to the VPN Gateway SKU selected for the Expressroute connection.
As many of you, I’ve come from a classic SOAP world and not having the standardisation in the restful world gives me shivers. But that’s not the case anymore, it’s all hanging together rather nicely now.
When you start to work on the RESTful design of your APIs, the following questions arise very quickly-
- How do I describe my API to the external parties without having to maintain the documentation all the time which can go out of sync very quickly?
- How do I share the contracts only with the other development teams, allowing them to work independently in a decoupled manner.
- How do I validate the request-response from/to the client?
- How do I create test harness with load tests automatically?
So here are the answers which will hopefully give you some comfort-
Describe your API-
Swagger is the WSDL of RESTful world. Swagger has been recently adopted by Open API Inititaive (OAI) for its OpenAPI Specification making it a vendor nuetral industry standard. OAI is backed by the key industry players like APIGEE, IBM, Google and of course Microsoft. Notably, AWS is not the member yet but it’s API Gateway is compatible with Swagger.
Json Schema is the Xsd of the RESTful world. OAI specification (fka swagger) uses standard type system/schemas language defined by Json Schema Draft 4, atm. Json schemas are backed by IETF, again making it vendor nuetral industry standard.
In Azure world, Swagger is fully supported as well. As an example- API Management, API Apps and Logic Apps, they all support Swagger (now known as OpenAPI Specification).
Equally, ARM Template Schemas use Json Schema Draft 4 specification to describe the resource templates, making it open and vendor nuetral to integrate with any external tooling. This is a great example as well if you wanted to see production grade schema and it’s design e.g. types, constraints etc.
That’s said, it is inevitable to avoid any further improvements/refinements in this space but I am confident, the direction of travel will not change much anymore, so it’s safe to use the above (in my opinion) for your RESTful design and evolve together as an industry.
My customer recently ran into this problem, which will come up when you try to configure your environment properly i.e. create a resource group and give only the required access to the resources in your organisation, following the principle of least privilege. The structure looks like below-
What’s going on here?
Objective: Anthony is a subscription admin and he wants to ensure a role based access control in applied to the resource groups. He takes the following steps to achieve this-
- He creates a resource group called A and give a ‘contributor’ access to the user called ‘Ben’.
- He then informs Ben to go ahead start using the resource group for the project.
- Ben logs into the portal with his credentials and try to create the resource.
- Resource creation fails with the error which looks like below- Registering the resource providers has failed. Additional details from the underlying API that might be helpful: ‘AuthorizationFailed’ – The client email@example.com’ with object id ‘b8fe1401-2d54-4fa2-b2dd-26c0b8eb69f9’ does not have authorization to perform action ‘Microsoft.Compute/register/action’ over scope ‘/subscriptions/dw78b73d-ca8e-34b9-89f4-6f716ecf833e’. (Code: AuthorizationFailed)
This will stump most of the people as expected. Why? because if you have the contributor access to a resource group, surely, you can create a resource e.g. a virtual machine. What went wrong here- Carefully at the error message and focus on ‘Microsoft.Compute/register/action’ over scope ‘/subscriptions/dw78b73d-ca8e-34b9-89f4-6f716ecf833e’. What does this say? it’s not the authorisation error to create a resource, it is the authorisation error to register a resource provider. This is expected, we don’t want a resource group level identity to register/unregister the resource providers at the subscription level. So how do we solve it? Option 1
- Log into Azure with an identity which has a subscription level access to register a resource provider e.g. admin/owner.
- Using PowerShell (PoSh) register the resource providers you need at the subscription level. You can also see which providers are available and registered already. Sample script is given below-
Login-AzureRmAccount $subscriptionId= "<Subscription Id>" Select-AzureRmSubscription -SubscriptionId $subscriptionId #List all available providers and register them Get-AzureRmResourceProvider -ListAvailable | Register-AzureRmResourceProvider
- Let the subscription admin/owner create the resource e.g. a VM.
- This will implicitly register the resource provider for the resources created.
Hope this was helpful.
I’ll be talking to the engineering to see if we can improve this user experience.
Microsoft Azure allow segmentation of applications, their tiers and the respective staging environments using the controls already built into Azure. Segmentation can be achieved at both network and user access level in Azure.
Segmentation is introduced for the following primary reasons-
Isolated/Safe Innovation Environment
To secure the platforms/applications it is important to-
Separate the application tiers in different Azure virtual network subnets with NSG (Network Security Group/Firewall).
This helps mitigating breach impact as only a limited access is provided to the layer down the stack. Also, it provides a safe container for the components within a tier/subnet to interact with each other, for example a SQL/MongoDB cluster.
User Defined Routes (UDR) in Azure can enforce traffic to route via a virtual intrusion detection appliance for enhanced security.
Employ the principle of least privilege (POLP) using Azure ARM.
This helps ensuring only a minimum required access is provided to the users for supporting the application/platform. For example, only infosec team will have access to manage the credentials, applications will only have read access and will not store credentials on the file system at any time. This also limits the impact of breach in any tier.
To ensure each application tiers individually and the application themselves provide a guaranteed performance and quality of service (QoS), it’s important to-
Implement SoC (Separation of Concerns) principle and avoid mixing different workloads on the same tier/VM.
Understand disk IOPS thresholds and segment the storage accounts accordingly.
Understand networking/bandwidth thresholds and separate production traffic from dev\test to maintain network QoS.
Azure fully supports agile methodology natively including the concepts of continuous integration/deployment, blue-green deployments , A/B testing and Canary Releases. By clearly segmenting and demarcating the boundaries of the services\APIs and their environments, following microservices principles, we can deploy and upgrade the services with minimal impact on the platform. Azure natively support microservices architecture and provides a fully managed platform for running a highly sophisticated microservices (ServiceFabric) in cloud.
Isolated/Safe Innovation Environment
To ensure developers, testers and release management teams get a secure environment to deploy applications and platform, it’s important to implement the above mentioned security concepts i.e. NSG, UDR, ARM RBAC/Policies. A well designed environment provides developers a safe environment to try out new technologies without any hindrance to continue to innovate and deliver business value.
Recently, I came across a requirement from my customer to migrate the data from AWS RDS/SQL service to Azure for some Big Data Analysis. Obvious choice for this sort of activity in Azure is to use Azure Data Factory (ADF) feature. Now there are many examples of ADF on MSDN with various different data sources and destinations except for some and one of which is AWS RDS.
So how do you achieve it? Simple, treat AWS RDS/SQL as an on-prem SQL Server and follow the guidance for this specific scenario using Data Management Gateway.
Essentially you need to do the following from a very high level perspective-
- Create an instance on EC2 in AWS and configure relevant firewall rules (as specified in guidance)
- Deploy Data Management Gateway on the above instance.
- Test the RDS/SQL access via Data Management Gateway tool from the above instance.
- Create ADF factory to read from SQL Server linked service via Gateway.
- Do the mapping of data.
- Store it in the destination of your choice (e.g. Blob storage)