Alerting and Monitoring in Azure

Alerting and monitoring aspect in most solutions is an afterthought. This thinking can lead production issues and organisations missing their O/SLAs which they agreed with their business or other parties.

When it comes to Azure, we have a range of services to cover all aspect of monitoring-

  1. Azure Monitor-
    A consolidated alerting and monitoring for Azure 1st party/core services.
  2. Azure Alerts-
    By using alerts, you can configure conditions over data and get notified when the conditions match the latest monitoring data.
  3. Service Health-
    It is a suite of experiences that provide personalized guidance and support when issues in Azure services affect you.
  4. Azure Security Center-
    It provides unified security management and advanced threat protection across hybrid cloud workloads.
  5. Azure Advisors-
    It is a personalized cloud consultant that helps you follow best practices to optimize your Azure deployments.
  6. Application Insights-
    It is an extensible Application Performance Management (APM) service for web developers building and managing apps on multiple platforms.
  7. Log Analytics-
    It plays a central role by consolidating monitoring data from different sources and providing a powerful query language (Kusto) for consolidation and analysis.
  8. Azure Mobile App-
    App to manage your Azure resources, in essence, a portal in a phone.

Although, the above services can cover alerting and monitoring fully, it can be an overwhelming exercise to make them work together and understand their individual contributions to the overall solution.

To help understand how these services are connected together and can benefit you as a consumer of Azure services, I’ve put together a following diagram-

…Next up, surface information for different stakeholders, essentially create a dashboard as a product for each role.

Unhealthy event: SourceId=’FabricDCA’, Property=’DataCollectionAgent.DiskSpaceAvailable’, HealthState=’Warning’, ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

You will often see this error pretty much right away when your Service Fabric cluster comes up if you are using a VMSS with VMs having smaller temporary disk sizes (d:\).

So what’s going on here?

By default Service Fabric’s reliable collections logs for reliable system services are stored in D:\SvcFab, to verify this you can remote desktop in to one of the VMs in VMSS where this warning is coming from. Most people will only see this warning in primary node type as the services you as a customer create are generally stateless and hence no stateful data logs are present on the non primary node types.

Default log size for replicator log (reliable collections) in MB is 8192 so if your temporary disk is 7GB (Standard_D2_v2) for example you will see the warning message in the cluster explorer as below-

Unhealthy event: SourceId=’FabricDCA’, Property=’DataCollectionAgent.DiskSpaceAvailable’, HealthState=’Warning’, ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

How to fix this?

You can change the default replicator log size by adding the FabricSetting in the ARM template named “KtlLogger” like highlighted below, this file size does not change once configured (grow or shrink)-

“fabricSettings”: [
“name”: “Security”,
“parameters”: [
“name”: “ClusterProtectionLevel”,
“value”: “EncryptAndSign”
“name”: “KtlLogger”,
“parameters”: [
“name”: “SharedLogSizeInMB”,
“value”: “4096”


For VM temporary disk sizes and specs, see here-

More info around configuring reliable services\manifest is here-


Service Fabric Log

Notes from the field on Azure Service Fabric (ASF) and some less known facts-


  1. Why do we need a durability level Silver or Gold? Silver and Gold tier allows SF to integrate with underlying VMSS resulting in the following features-
    • You can scale back the underlying VMSS after scaling it out and ASF will recognise this change and will not mark the cluster as unhealthy. Also note, you cannot scale back the nodes to anything below 5 even though when you create a cluster a Silver/Gold node type has only 3 nodes.
    • Allows ASF to intercept and delay VM level actions requested by the platform or cluster admin to allow stateful services to maintain the minimum replicaset/quorum at any point in time.
  2. Can I change durability tier of the existing cluster/node type?
    • Yes you can upgrade from lower levels to higher and from Gold -> Silver.
  3. Why do we need minimum of 5 nodes in primary node type? Because-
    • To maintain the quorum, you need majority of the nodes running at any point in time in the primary node type. If you are upgrading ASF binaries to the new version and Microsoft decides to update the host machine which hosts one of your 3 VMs then in this situation you are 2 VMs down out of 3, so this will impact the stateful system services. If you had 5 VMs, taking out 2 of 5 will still have 3 (majority) available.
  4. Does Microsoft support ASF cluster spanning across the multiple Azure DCs?
    • Generally speaking, it is not supported.
  5. Can I add/remove Node Types after the cluster is created?
    • Yes.
  6. Can I scale ASF?
    • Yes, via VMSS auto/manual scale mechanism only at present.
  7. Can I make unsecure cluster a secure cluster?
    1. No
  8. Can I scale stateful services?
    1. Yes, by partitioning the data which allows multiple parallel service type instances receiving the requests for their respective partitions.
  9. Each application instance runs in isolation, with its own work directory and process.
  10. Service Fabric process runs in kernel mode hence applications running under it will not be able to crash it easily.
  11. By default, the cluster certificates are added to the allowed Admin certificates list hence the template here secures both node-node and client-node comms. You can though add separate client certs for readonly and admin cluster roles.
  12. Scaling out VMSS causes other nodes in the same VMSS to change to stopping state. This is a superficial UI bug and will be fixed soon, VMs in the VMSS do not stop in reality.
  13. Can I create a SF cluster with small size VMs?
    • Yes, you can but please bear in mind when you do that cluster may start to throw the warnings, see this post to remove those warnings.

Migrate from AWS RDS/SQL using ADF

Recently, I came across a requirement from my customer to migrate the data from AWS RDS/SQL service to Azure for some Big Data Analysis. Obvious choice for this sort of activity in Azure is to use Azure Data Factory (ADF) feature. Now there are many examples of ADF on MSDN with various different data sources and destinations except for some and one of which is AWS RDS.

So how do you achieve it? Simple, treat AWS RDS/SQL as an on-prem SQL Server and follow the guidance for this specific scenario using Data Management Gateway.

Essentially you need to do the following from a very high level perspective-

  1. Create an instance on EC2 in AWS and configure relevant firewall rules (as specified in guidance)
  2. Deploy Data Management Gateway on the above instance.
  3. Test the RDS/SQL access via Data Management Gateway tool from the above instance.
  4. Create ADF factory to read from SQL Server linked service via Gateway.
  5. Do the mapping of data.
  6. Store it in the destination of your choice (e.g. Blob storage)

Adding Authentication via ARM for API Apps/Gateway

API Apps Preview 2 has changed the auth model defined below, please refer here for details about what’s changed]

This one was left out for a long I must admit. Since I joined Microsoft, I was keeping very busy learning about my new role, organisation and the on-boarding process. Today is the first weekend I have some breathing space to revisit this but in the in meanwhile I had some excellent pointers from Gozalo Ruiz (Lead CSA in my team) on this which led me to resolve this faster than I would have otherwise.

Here’s the problem, I had a fully automated ALM pipeline configured to build, test and deploy API App to Azure from VS Team Services (previously known as VS Online) except that there was no easy way to configure authentication identity for the gateway. For those who don’t know how API App authentication works (this is set to change now, gateway will not be requirement in future), each API App is fronted by a gateway which manages the authentication for each API App within the same Resource Group. I had a need to secure my API via Azure AD so I used Azure Active Directory as a provider in the gateway (See this post if you want to learn a bit about authentication mechanism in API Apps, its a topic in itself though).

Here’s the screenshot of the configuration which the gateway should have been populated with via ARM deployment.


Solution is simple, populate the relevant appSettings for this configuration when you create the API App with Gateway but it wasn’t easy to find these (wish it was) but here they for your use. Refer to the complete template here

"appSettings": [
 "name": "ApiAppsGateway_EXTENSION_VERSION",
 "value": "latest"
 "name": "EmaStorage",
 "value": "D:\\home\\data\\apiapps"
 "value": "1"
 "name": "MS_AadClientID",
 "value": "21EC2020-3AEA-4069-A2DD-08002B30309D"
 "name": "MS_AadTenants",
 "value": ""

If you are using other identity providers than AAD, you could use the one of these instead (I’ve not tested these ones but should work in theory)





Api App ZuMo Authentication\Testing in CI

[API Apps Preview 2 has changed the auth model defined below, please refer here for details about what’s changed]

Recently, I ran into a situation where one of my in-house development teams wanted to run the load test in the CI pipeline on the Api App they developed and deployed in Azure. Api app was using an Azure AD as an identity provider via app service gateway.

In order to solve this problem you first need to understand how OAuth works in the Api App case. A good explanation of this provided here by Tom Dykstra.

We are using a client flow authentication mechanism in particular in this instance as our scenario was based on service to service interaction, no user/password prompts etc. I’ve used this service to service flow in the past for the web api apps (pre-app services incarnation). This flow is defined here and uses OAuth 2.0 Client Credentials Grant Flow. So I was very keen on using the same workflow for the Api Apps as well as it allows me to use different client azure AD app to authenticate without impacting my Api App service.

Please follow the article mentioned above to setup the client and service (Api App) apps in Azure AD. Once done we should have something like below logically.


Let’s cut to the chase, here is how at the http level client flow for Api App authentication works-


Here are what the Http requests and responses looks like (via fiddler)-

  1. POST HTTP/1.1

    Accept: application/json
    Content-Type: application/x-www-form-urlencoded



    Explanation- client_id and client_secret here is for the client AD app and not the API App AD app. Because client AD app has been given the access to the API App AD app in Azure AD, token returned will be valid for the API App AD app. This way you can remove access to your API App by the clients without changing any config in the API App AD app.

  2. HTTP/1.1 200 OK

    Content-Type: application/json; charset=utf-8


    "token_type": "Bearer",
    "expires_in": "3600",
    "expires_on": "1445096696",
    "not_before": "1445092796",
    "resource": "",
    "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ik1uQ19WWmNBVGZNNXBPWWlKSE1iYTlnb0VLWSIsImtpZCI6Ik1uQ19...."
  3. POST HTTP/1.1

    Content-Type: application/json; charset=utf-8


    "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ik1uQ19WWmNBVGZNNXBPWWlKSE1iYTlnb0VLWSIsImtpZCI6Ik1uQ19...."
  4. This step happens behind the scenes hence no fiddler trace available.
  5. HTTP/1.1 200 OK

    Content-Type: application/json; charset=utf-8


    "user": { "userId": "sid:171BC49224A24531BDF480132959DD54" },
    "authenticationToken": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmdWxscm93IjoiYWxsIiwiRGJnMiI6ImxvZ2luIiwidmVyIjoiMyIsIn...."
  6. GET HTTP/1.1

    X-ZUMO-AUTH: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmdWxscm93IjoiYWxsIiwiRGJnMiI6ImxvZ2luIiwidmVyIjoiMyIsIn....
  7. HTTP/1.1 200 OK

    HTTP/1.1 200 OK
    Content-Type: application/json; charset=utf-8


    "PostcodeFull": "KT19 8QJ",
    "PostTown": "EPSOM",
    "DependentLocality": null,
    "DoubleDependentLocality": null,
    "ThoroughfareAndDescriptor": "Parkview Way",
    "DependentThoroughfareAndDescriptor": null,
    "BuildingNumber": "34",
    "BuildingName": null,
    "SubBuildingName": null,
    "POBox": null,
    "DepartmentName": null,
    "OrganisationName": null,
    "UDPRN": "51946386",
    "PostcodeType": "S"

As you can see you can easily replicate this communication using HttpClient in .net (or in any other language for that matter) to get the ZuMo token for calling the authenticated operations on the Api App. This exactly what we did and placed that logic the WebTest Plugin in Visual Studio web performance test to automate this process in the CI pipeline. Load Test was then developed to use the web performance test and it placed the X-ZUMO-AUTH header in the http request to the Api App with the retrieved ZuMo token in real-time.

Advantages of this approach-

  1. You don’t have to share your service (Api App) master key with the clients. Clients use their app secret\key from their Azure AD app which has access to the service which they want to use. You can use this approach in production environment.
  2. You are testing the application authentication flow as it would be in production for the users.

If you want the plugin code for this, please give me a shout (I’ll put that on the GitHub anyway once I get the opportunity), here’s the code until then if you need (caveat: you will need to take care of refreshing the token after 1 hour in this instance, more on that here)

If you are using a server flow for authentication instead of a client flow, you can take the approach mentioned here (by Yossi Dahan) as you may not have client flow authentication code in your app. But there is no reason why you could not use this approach there as well.

Azure Stream Analytics Lag (and LAG)

Recently, I started to look at Azure PaaS model for MI/BI/Analytics solutions for one of my low latency\high throughput based projects here at work. After much thought and considering options like Storm, Redis Cache and Stream Analytics etc. I came to the conclusion that the latter is the most suitable technology for my requirements. So off I went and started to prototype it, experience has been largely pleasant except a few things which I will mention shortly here in this blog.

Ok, my key requirement for this was to analyse the data stream as it flows through and highlight the confidence of the applicant who is giving the driving theory test (for IPR reasons I’ve changed the example here), my objective is to find the applicants who are not confident of their answers when sitting for the theory driving test. Now this analysis in itself can be a major subject of discussion but I will leave it for another day and focus of the technology side of the things here rather. After looking at the examples on the web for this (like here) I was convinced that I could use inner join in the query to achieve what I wanted to achieve for my requirements (which are described below) until I bumped into another function called LAG which is discussed later.

Here are the requirements-

Data is received in the JSON format as described below-

 "ActivityId": "c8c64b10-f029-4dc6-93c5-31373ed72a31",
 "Timestamp": "2015-05-15T08:04:26.9445222Z",
 "ProcessingServer": "RD000D3A21C503",
 “ApplicantId”: “C0001”,
 "Attributes": {
 “BrakingDistanceAt20MPH”: “40”,
 “BrakingDistanceAt30MPH”: “75”
 "ActivityId": "c8c64b10-f029-4dc6-93c5-31373ed72a31",
 "Timestamp": "2015-05-15T08:04:26.9445222Z",
 "ProcessingServer": "RD000D3A21C503",
 “ApplicantId”: “C0001”,
 "Attributes": {
 “BrakingDistanceAt20MPH”: “40”,
 “BrakingDistanceAt30MPH”: “80”

Now in the above example if someone is changing the answers often as he\she is not sure of it then we would like to highlight that or even dynamically load questions which are more focused on this specific area. Technically, we want to emit an event every time an applicant changes the answer within certain time window (lets say within an hour in our case). That event will then be stored in cache along with the past events to provide the single applicant view at any point in time.

One way to achieve this in Stream Analytics is to use inner self join like this (inspired by this blog)

 System.Timestamp as ReceivedOn,
 QMM1.Attributes.BrakingDistanceAt20MPH AS FromBrakingDistanceAt20MPH,
 QMM2.Attributes.BrakingDistanceAt20MPH AS ToBrakingDistanceAt20MPH,
 QMM1.Attributes.BrakingDistanceAt30MPH AS FromBrakingDistanceAt30MPH,
 QMM2.Attributes.BrakingDistanceAt30MPH AS ToBrakingDistanceAt30MPH
 Input QMM2 TIMESTAMP BY [TimeStamp]
 QMM1.ApplicantID = QMM2.ApplicantID
 DATEDIFF(ss, QMM1, QMM2) BETWEEN 1 AND 3600 -- for 1 hour
 QMM1.Attributes.BrakingDistanceAt20MPH != QMM2.Attributes.BrakingDistanceAt20MPH
 QMM1.Attributes.BrakingDistanceAt30MPH != QMM2.Attributes.BrakingDistanceAt30MPH

This will work as expected but there are two minor issues with this-

  1. Output is not relative i.e. it shows the state change from initial state to final state i.e. 75->80, 75->85 for 30mph braking distance (ideally I would need 75->80, 80-85 mph, see the output below)
  2. Output is not real-time (reason is not known, possibly optimisation), I could only see the emitted events in the sink after 2-3 minutes which would defeat the purpose of using Stream Analytics\Event Hub for my scenario. This is recently explained by Zhong Chen here.

Output from the above query looks like this-

2015-05-15T08:05:26.944Z C0001 40 40 75 80
2015-05-15T08:05:26.944Z C0001 40 40 75 85

Here comes the help from another analytic function in Stream Analytics called LAG. This function compares your record for certain field(s) with the previous record in the defined time window and emits the event when the difference is found. This is exactly what I wanted  and the syntax is very concise as well so started to get on with it. Here’s the query-

WITH FlatInput
 SELECT ApplicantID, 
 Attributes.BrakingDistanceAt20MPH AS BrakingDistanceAt20MPH, 
 Attributes.BrakingDistanceAt30MPH AS BrakingDistanceAt30MPH
 FROM Input
SELECT ApplicantID, BrakingDistanceAt20MPH AS ToBrakingDistanceAt20MPH,
  LAG(BrakingDistanceAt20MPH) OVER (LIMIT DURATION(ss, 3600)) as FromBrakingDistanceAt20MPH,
  BrakingDistanceAt30MPH AS ToBrakingDistanceAt30MPH,
  LAG(BrakingDistanceAt30MPH) OVER (LIMIT DURATION(ss, 3600)) as FromBrakingDistanceAt30MPH
 FROM FlatInput

This was not as simple as it seems here, problem is that LAG function does not like qualified fields (or nested fields) in JSON i.e. I could not use Attributes.BrakingDistanceAt20MPH is LAG function above directly, it will either throw the error saying invalid field name or give you a compile time error. This error is a bit misleading though as the field name is perfectly valid qualified name so I spoke to the MS CSA (Rupert Benbrook) who then asked me to try flattening the rows first using WITH and then use the aliases in LAG function as a second query (as defined above)  and guess what it started to work. So this is something to keep in mind when you use LAG function.

Here is the output of the above query, displaying the relative changes in output (30mph) now-

C0001 40 75
C0001 40 40 80 75
C0001 40 40 85 80

In the end, I resorted to LAG function which is a much cleaner way to emit the events as per my requirements. Hope this helps someone who is running with the similar activity.

Azure Cloud Service Configuration

This is a bit of a surprise, when you run a web role using “Use IIS Webserver” from VS 2013 it does not run under cloud service context hence you will need to migrate the settings from .cscfg to web.config appsettings. You will be however able to use CloudConfigurationManager.GetSetting method to get the settings from appsettings as well. Now, if you run the same web role using “Use IIS Express” setting your service gets run under a cloud service context hence it reads the config data from .cscfg file even though both settings will run services under the compute emulator.