Unhealthy event: SourceId=’FabricDCA’, Property=’DataCollectionAgent.DiskSpaceAvailable’, HealthState=’Warning’, ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

You will often see this error pretty much right away when your Service Fabric cluster comes up if you are using a VMSS with VMs having smaller temporary disk sizes (d:\).

So what’s going on here?

By default Service Fabric’s reliable collections logs for reliable system services are stored in D:\SvcFab, to verify this you can remote desktop in to one of the VMs in VMSS where this warning is coming from. Most people will only see this warning in primary node type as the services you as a customer create are generally stateless and hence no stateful data logs are present on the non primary node types.

Default log size for replicator log (reliable collections) in MB is 8192 so if your temporary disk is 7GB (Standard_D2_v2) for example you will see the warning message in the cluster explorer as below-

Unhealthy event: SourceId=’FabricDCA’, Property=’DataCollectionAgent.DiskSpaceAvailable’, HealthState=’Warning’, ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

How to fix this?

You can change the default replicator log size by adding the FabricSetting in the ARM template named “KtlLogger” like highlighted below, this file size does not change once configured (grow or shrink)-

“fabricSettings”: [
{
“name”: “Security”,
“parameters”: [
{
“name”: “ClusterProtectionLevel”,
“value”: “EncryptAndSign”
}
]
},
{
“name”: “KtlLogger”,
“parameters”: [
{
“name”: “SharedLogSizeInMB”,
“value”: “4096”
}
]
}
]

 

For VM temporary disk sizes and specs, see here- https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general

More info around configuring reliable services\manifest is here-

 

Service Fabric to Cloud Services, Why?

These are some interesting benefits of using Service Fabric (SF) (over Cloud Services and in general)-

  1. High Density- Unlike cloud services you can run multiple services on a single VM saving you both cost and management overhead, SF will re-balance or scale out the cluster if resource contention is predicted or occurs.
  2. Any Cloud or Data Center- Service Fabric cluster can be deployed in Azure, on-premise or even in a 3rd party cloud if you need to due unforeseen change in company’s direction or regulatory requirements. It just runs better in Azure, why? Because certain services are provided in Azure as a value addition e.g. upgrade service.
  3. Any OS- Service Fabric cluster can run on both Windows and Linux. In near future, you will be able to have a single cluster running both Windows and Linux workloads in parallel.
  4. Faster Deployments- As you do not create a new VM per service like in cloud services, new services can be deployed to the existing VMs as per the configured placement constraints, making deployments much faster.
  5. Application Concept- In microservices world, multiple services working together forms a business function or an application. SF understands the concept of application than just individual services which constitutes a business function. SF treats and manages application and it’s services as one entity to maintain the health and consistency of the platform, unlike cloud services.
  6. Generic Orchestration Engine- Service Fabric can orchestrate both at process and container level should you need to. One technology to learn to rule them all.
  7. Stateful Services- A managed programming model to develop stateful services following  the OOPs principle of encapsulation i.e. keeping state and operations as a unit. Other container orchestration engines cannot do this. And of course you can develop reliable stateless services as well.
  8. Multi-tenancy- Deploy multiple versions of the same application for multiple clients side by side or do a canary testing.
  9. Rolling Upgrades-  Upgrade both applications and platform without any downtime with a sophisticated rolling upgrade feature set.
  10. Scalable- Scale to hundreds or thousands of VMs if you need to with auto scaling or manual.
  11. Secure- Inter VM encryption, cluster management authentication/authorization (RBAC), network level isolation are just a few ways to secure your cluster in the enterprise grade manner.
  12. Monitoring- Unlike cloud services SF comes with a built in OMS solution which understands the events raised by SF cluster and take appropriate action. Both inproc and out of proc logging is supported.
  13. Resource Manager Deployments– Unlike cloud services which still runs in a classic deployment model, SF cluster and applications uses resource manager way of deployments which are much more flexible and deploys only the artefacts you need.
  14. Pace of Innovation- Cloud services is an old platform, still used by many large organisations for critical workloads but it is not the platform which will get new innovative features in future.

More technical differences are here.

Service Fabric Log

Notes from the field on Azure Service Fabric (ASF) and some less known facts-

FAQs-

  1. Why do we need a durability level Silver or Gold? Silver and Gold tier allows SF to integrate with underlying VMSS resulting in the following features-
    • You can scale back the underlying VMSS after scaling it out and ASF will recognise this change and will not mark the cluster as unhealthy. Also note, you cannot scale back the nodes to anything below 5 even though when you create a cluster a Silver/Gold node type has only 3 nodes.
    • Allows ASF to intercept and delay VM level actions requested by the platform or cluster admin to allow stateful services to maintain the minimum replicaset/quorum at any point in time.
  2. Can I change durability tier of the existing cluster/node type?
    • Yes you can upgrade from lower levels to higher and from Gold -> Silver.
  3. Why do we need minimum of 5 nodes in primary node type? Because-
    • To maintain the quorum, you need majority of the nodes running at any point in time in the primary node type. If you are upgrading ASF binaries to the new version and Microsoft decides to update the host machine which hosts one of your 3 VMs then in this situation you are 2 VMs down out of 3, so this will impact the stateful system services. If you had 5 VMs, taking out 2 of 5 will still have 3 (majority) available.
  4. Does Microsoft support ASF cluster spanning across the multiple Azure DCs?
    • Generally speaking, it is not supported.
  5. Can I add/remove Node Types after the cluster is created?
    • Yes.
  6. Can I scale ASF?
    • Yes, via VMSS auto/manual scale mechanism only at present.
  7. Can I make unsecure cluster a secure cluster?
    1. No
  8. Can I scale stateful services?
    1. Yes, by partitioning the data which allows multiple parallel service type instances receiving the requests for their respective partitions.
  9. Each application instance runs in isolation, with its own work directory and process.
  10. Service Fabric process runs in kernel mode hence applications running under it will not be able to crash it easily.
  11. By default, the cluster certificates are added to the allowed Admin certificates list hence the template here secures both node-node and client-node comms. You can though add separate client certs for readonly and admin cluster roles.
  12. Scaling out VMSS causes other nodes in the same VMSS to change to stopping state. This is a superficial UI bug and will be fixed soon, VMs in the VMSS do not stop in reality.
  13. Can I create a SF cluster with small size VMs?
    • Yes, you can but please bear in mind when you do that cluster may start to throw the warnings, see this post to remove those warnings.