Powering Real-Time Messaging at Scale with Azure Cosmos DB

Microsoft Teams, Copilot, Azure Communication Services and many other product offerings from Microsoft, rely on a unified messaging platform that powers real-time communication and collaboration at an unprecedented scale. This messaging platform has become critical for enabling boundary-less collaboration, supporting hundreds of millions of users worldwide. To ensure global discovery, durable storage and performance needed for real time communication, the messaging platform relies on Azure Cosmos DB as one of its data storages. It has data distributed in most Azure regions, has several Petabytes of data and performs trillions of database transactions per day to power mission critical messaging scenarios. In this article, we will share why we chose Azure Cosmos DB and some of the learnings we have had after running it at scale.

Why we chose Azure Cosmos DB

To support our mission of enabling real-time communication and collaboration for hundreds of millions of users globally, we needed a data store that could meet our stringent requirements. Some of the most critical requirements were:

Global distribution with seamless replication across regions in public and sovereign clouds
Fully managed with automatic scale-out to reduce operational overhead
Multi-region reads and writes for effective global user and group discovery and routing
Built in resiliency and Automatic backups for better fault tolerance and disaster recovery
Ultra-low latency for both reads and writes to meet real time needs
Planet-scale throughput to handle massive and spiky traffic patterns

Azure Cosmos DB meets all of these needs and more. It powers several core components in our pipeline. We use partitioned collections to store users, groups metadata and messages, partitioned and denormalized to serve our queries efficiently. The change feed drives our downstream subscribers pipeline ensuring reliable delivery and supporting fan-out to multiple processing layers. During the early days of the COVID-19 pandemic, our storage infrastructure scaled seamlessly to meet the sudden surge in traffic—ensuring uninterrupted service during a time of unprecedented digital demand.

Scaling Lessons & Optimizations

Partitioning strategy – We’ve learned that thoughtful partition design is critical—suboptimal choices can lead to hot partitions, throttling, and degraded performance. To avoid cross-partition queries, we use fine-grained logical partitions. For example,
1. - We use User IDs and Group IDs as partition keys for storing metadata and messages, which provide sticky partitions ideal for user- and group-centric access patterns.
  - For our delivery pipeline, we use Event IDs to create non-sticky partitions that support high-throughput fan-out.
To further optimize for diverse query patterns, we store some data in a denormalized and duplicated form across containers, each configured with different partition keys. This approach allows us to tailor data access for specific scenarios—such as message rendering, roster lookups, or user centric left rail rendering—while minimizing latency and avoiding expensive cross-partition operations
Indexing policies – We apply tailored indexing strategies to support low-latency queries across various user experiences. To optimize both performance and storage efficiency, we disable the default indexing on all properties and selectively enable indexes only on fields required by our query patterns. Additionally, we leverage composite indexes where appropriate, which significantly enhance query performance as data volume grows within partitions.
Multi Write support – The multi-write capability is crucial for applications that require low-latency reads and writes from multiple regions and geographies, all while operating on same globally distributed data store. We use this capability for storing users & groups routing information for effective global routing enabling users across the globe to instantly discover chats, meetings and other groups and start messaging. While using multi-writes, we recommend
- Having the application with regional affinity as much as possible while performing writes and
- Avoiding patterns of rapid repeated writes on same documents which complicates conflicts resolutions.
Resilience by design – Building applications that can withstand regional outages is essential for mission-critical systems. Azure Cosmos DB offers several building blocks that support this goal, including:
- Automatic read hedging to route reads to healthy replicas even across regions
- Write redirection for multi write accounts to alternate regions in case of failures
- Per-partition automatic failover (currently in preview) for more granular resilience
By adopting these capabilities, your application can continue to operate reliably even during regional disruptions. However, it’s important to design these mechanisms with your application’s consistency requirements in mind.
Autoscaling for spikey traffic – We observe noticeable traffic spikes at the top and bottom of each hour, along with uneven load distribution across geographies throughout the day. If your application experiences similar patterns, enabling autoscaling—especially dynamic autoscaling (also known as per-region and per-partition autoscaling)—can help manage capacity more efficiently and cost-effectively. This approach ensures that your system continues to serve requests reliably, while Azure Cosmos DB automatically scales resources up or down based on actual demand. Importantly, it targets only the specific regions and partitions that require scaling. After rolling out dynamic autoscaling across several microservices, we observed cost savings ranging from 10% to 45%, depending on the workload.

Overall, leveraging Azure Cosmos DB has enabled us to deliver reliable, real-time messaging at a global scale, meeting the dynamic needs of users worldwide. As we continue to evolve, the lessons learned from operating at scale guide our ongoing optimizations for performance and resilience.

Leave a review

Tell us about your Azure Cosmos DB experience! Leave a review on PeerSpot and we’ll gift you $50. Get started here.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.

To quickly build your first database, watch our Get Started videos on YouTube and explore ways to dev/test free.