DMSC Working Group H. Yang
Internet-Draft T. Yu
Intended status: Standards Track Q. Yao
Expires: 4 September 2025 Z. Zhang
Beijing University of Posts and Telecommunications
3 March 2025

Microservice Communication Resource Scheduling for Distributed AI Model
draft-yang-dmsc-distributed-model-03

Abstract

This document describes the architecture of microservice
communication resource scheduling for distributed AI model.

Status of This Memo

This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

This Internet-Draft will expire on 4 September 2025.

This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.

Yang, et al. Expires 4 September 2025 [Page 1]

Internet-Draft DMSC Architecture March 2025

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Conventions used in this document . . . . . . . . . . . . . . 4
3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Scenarios and requirements . . . . . . . . . . . . . . . . . 4
4.1. AI Microservice model scenario requirements . . . . . . . 5
4.2. Distributed Micro model Service Flow . . . . . . . . . . 6
5. Key issues and challenges . . . . . . . . . . . . . . . . . . 6
5.1. Balancing Compute and Network Resources under
Constraints . . . . . . . . . . . . . . . . . . . . . . . 7
5.2. Data Collaboration Challenges under Block Isolation . . . 7
6. Distributed solution based on model segmentation . . . . . . 8
6.1. Business layer . . . . . . . . . . . . . . . . . . . . . 12
6.1.1. Microservices and Micromodels . . . . . . . . . . . . 12
6.1.2. Microservice Gateway and API Gateway . . . . . . . . 13
6.1.3. Service Registration and Discovery Center . . . . . . 13
6.2. Control layer . . . . . . . . . . . . . . . . . . . . . . 14
6.2.1. Task management module . . . . . . . . . . . . . . . 14
6.2.2. Exception task queue module . . . . . . . . . . . . . 14
6.2.3. Log management system . . . . . . . . . . . . . . . . 15
6.2.4. Model segmentation interface . . . . . . . . . . . . 15
6.2.5. Model segmentation module . . . . . . . . . . . . . . 15
6.2.6. Model segmentation scheduling . . . . . . . . . . . . 17
6.2.7. Model segmentation aggregation . . . . . . . . . . . 18
6.3. Computing power layer . . . . . . . . . . . . . . . . . . 19
6.3.1. Calculation of micro-model parameters . . . . . . . . 19
6.3.2. Distributed computing power parameter update . . . . 20
6.3.3. Distributed unified Collaboration module . . . . . . 21
6.3.4. Load balancing and resource allocation mechanism . . 22
6.3.5. Computing power execution module . . . . . . . . . . 22
6.3.6. Data management module . . . . . . . . . . . . . . . 22
6.3.7. Fault tolerance and recovery module . . . . . . . . . 23
6.3.8. Computing resource pool . . . . . . . . . . . . . . . 23
6.4. Data layer . . . . . . . . . . . . . . . . . . . . . . . 23
6.4.1. Privacy protection . . . . . . . . . . . . . . . . . 24
6.4.2. Database maintenance and update . . . . . . . . . . . 24
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25
8. Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 25
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.1. Normative References . . . . . . . . . . . . . . . . . . 25
9.2. Informative References . . . . . . . . . . . . . . . . . 25
Appendix A. An Appendix . . . . . . . . . . . . . . . . . . . . 26
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 26

Yang, et al. Expires 4 September 2025 [Page 2]

Internet-Draft DMSC Architecture March 2025

1. Introduction

The architecture of microservice communication resource scheduling
for distributed AI model is a structured framework designed to
address the challenges of scalability, flexibility, and efficiency in
modern AI systems. By integrating model segmentation, micro-model
deployment, and microservice orchestration, this architecture enables
the effective allocation and management of computing resources across
distributed environments. The primary focus lies in leveraging model
segmentation to decompose large AI models into smaller, modular
micro-models, which are executed collaboratively across distributed
nodes.

The architecture is organized into four tightly integrated layers,
each with distinct roles and responsibilities that together ensure
seamless functionality:

Business Layer: This layer acts as the interface between the user-
facing applications and the underlying system. It encapsulates AI
capabilities as microservices, enabling modular deployment, elastic
scaling, and independent version control. By routing user requests
through service gateways, it ensures efficient interaction with back-
end micro-models while balancing workloads. The business layer also
facilitates collaboration between multiple micro-models, allowing
them to function as part of a cohesive distributed system.

Control Layer: The control layer is the central coordination hub,
responsible for task scheduling, resource allocation, and the
implementation of model segmentation strategies. It decomposes large
AI models into smaller, manageable components, assigns tasks to
specific nodes, and ensures synchronized execution across distributed
environments. This layer dynamically balances compute and network
resources while adapting to system demands, ensuring high efficiency
for training and inference workflows.

Computing Layer: As the execution core, this layer translates the
decisions made by the control layer into distributed computation. It
executes segmented micro-models on diverse hardware resources such as
GPUs, CPUs, and accelerators, optimizing parallelism and fault
tolerance. By coordinating with the control layer, it ensures that
tasks are executed efficiently while leveraging distributed
orchestration frameworks to handle diverse workloads.

Yang, et al. Expires 4 September 2025 [Page 3]

Internet-Draft DMSC Architecture March 2025

Data Layer: The data layer underpins the entire system by managing
secure storage, access, and transmission of data. It provides the
necessary datasets, intermediate results, and metadata required for
executing segmented micro-models. Privacy protection mechanisms,
such as federated learning and differential privacy, ensure data
security and compliance, while distributed database operations
guarantee consistent access and high availability across nodes.

At the heart of this architecture is model segmentation, which serves
as the foundation for effectively distributing computation and
optimizing resource utilization. The control layer breaks down
models into smaller micro-models using strategies such as layer-
based, business-specific, or block-based segmentation. These micro-
models are then deployed as independent services in the business
layer, where they are dynamically scaled and orchestrated to meet
real-time demands. The computing layer executes these tasks using
parallel processing techniques and advanced scheduling algorithms,
while the data layer ensures secure and efficient data flow to
support both training and inference tasks.

By tightly integrating these layers, the architecture addresses
critical challenges such as balancing compute and network resources,
synchronizing distributed micro-models, and minimizing communication
overhead. This cohesive design enables AI systems to achieve high
performance, scalability, and flexibility across dynamic and
resource-intensive workloads.

This document outlines the design principles, key components, and
operational advantages of the microservice communication resource
scheduling architecture for distributed AI models., emphasizing how
model segmentation, micro-models, and microservices form the
foundation for scalable and efficient distributed AI systems.

2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].

3. Terminology

TBD

4. Scenarios and requirements

Yang, et al. Expires 4 September 2025 [Page 4]

Internet-Draft DMSC Architecture March 2025

4.1. AI Microservice model scenario requirements

In contemporary times, as artificial intelligence technology evolves
at an accelerated pace, the scale and intricacy of AI models are
continuously expanding. The traditional monolithic application or
centralized reasoning and training model is progressively becoming
inadequate to meet the swiftly changing business demands.
Encapsulating AI capabilities within a microservices architecture can
confer substantial advantages in terms of system flexibility,
scalability, and service governance. By decoupling models through
microservices, an independent AI model service can circumvent
potential bottlenecks that arise from deep coupling with other
business logic components, and it can also achieve elastic scaling
during surges in requests or training loads. Given the rapid
iteration and upgrade cycles of AI models, a microservice
architecture facilitates the coexistence of multiple model versions,
enables gray-scale releases, and supports rapid rollbacks, thereby
minimizing the impact on the overall system.

The requirements for computing power in AI microservice models are
often extremely demanding. On the one hand, the training or
inference process usually involves massive data processing and high-
density parallel computing, requiring the collaborative work of
various hardware resources such as GPU, CPU, FPGA, NPU, etc; On the
other hand, if the model scale is large or the request volume is
high, the computing power of a single machine is often insufficient
to meet business needs. It is necessary to perform parallel
computing on multiple nodes through a distributed mode and release
resources reasonably during idle time to improve utilization. This
type of distributed training or inference typically relies on
efficient communication strategies to synchronize model parameters or
gradients, and methods such as AllReduce or All to All are often used
to reduce communication overhead and ensure model consistency.

In the distributed system, the network plays a crucial role. A large
number of model parameters and gradients need to be exchanged
frequently during the calculation process, which puts forward high
requirements for network bandwidth and delay. In the large-scale
cluster scenario, the reasonable design of the network topology and
the choice of the communication framework can not be ignored. Only
in the high-bandwidth, low-latency network environment, combined with
the appropriate communication library (such as NCCL, MPI, etc.), can
the cluster fully exploit the potential of computing power and avoid
communication becoming the bottleneck of global performance.

Yang, et al. Expires 4 September 2025 [Page 5]

Internet-Draft DMSC Architecture March 2025

4.2. Distributed Micro model Service Flow

In the microservice communication resource scheduling architecture
for distributed AI models, the core of the business process is how to
realize the multi-node layout and collaborative work of the model to
ensure efficient parameter synchronization and communication.
Typically, a model is trained and evaluated using a deep learning
framework during development, and then container-ized or mirrored to
package the model and its dependencies into a service that can be
deployed independently. Then, these encapsulated model services are
registered to the system's microservice management platform for
subsequent unified scheduling and access.

The micromodel is deployed to a distributed cluster, computing power
orchestration and resource scheduling allocates computing resources
such as GPU or CPU according to real-time load, business priority and
hardware topology, and uses container orchestration tools (such as
Kubernetes) to start corresponding service instances on each node.
When distributed cooperation is needed, NCCL, Horovod and other
frameworks are used to complete inter-process communication.
Requests from upper business systems or users usually arrive at API
Gateway or service gateway first, and then are distributed to the
target service instance according to load balancing or other routing
policies. If distributed reasoning is needed, multiple nodes
cooperate to perform model segmentation reasoning and summarize the
results, and finally return the reasoning results to the requester.
In this process, real-time monitoring and elastic scaling mechanism
play an important role in ensuring system stability and optimizing
resource utilization. On the monitoring level, through a unified
data acquisition and analysis platform, the system can track core
indicators such as GPU utilization, network traffic, and request
latency of each service node, so as to provide timely alarms in case
of failures, performance bottlenecks, or insufficient resources, and
perform automatic failover or node offline processing.

In addition, the distributed micromodel business flow needs to be
combined with the data backflow mechanism. A large number of logs,
user feedbacks and interactive information generated in the inference
process can be further used for the training of new models or the
performance optimization of existing models if they can be returned
to the data platform under the premise of meeting privacy and
compliance requirements.

5. Key issues and challenges

Yang, et al. Expires 4 September 2025 [Page 6]

Internet-Draft DMSC Architecture March 2025

5.1. Balancing Compute and Network Resources under Constraints

With the continuous growth of AI model size and business demand, the
computing power resources of a single node or single cluster are
often difficult to support high-intensity training and inference
tasks, and it is prone to the problem of insufficient computing power
or sharp rise in cost. Through the distributed architecture to
coordinate computing resources between multiple nodes and multiple
regions, it can improve the overall efficiency and fault tolerance to
a certain extent. However, distributed deployment also brings higher
complexity, which not only considers the differences of heterogeneous
hardware (such as GPU, CPU, FPGA, etc.), but also needs to balance
the allocation of computing power under different network topology
and bandwidth conditions.

When computing network resources are scarce, it is necessary to
dynamically schedule and allocate computing power according to
business priority, model scale and real-time load conditions, and
combine strategic queuing, elastic scaling and scaling, and cross-
cluster resource collaboration to improve the overall service
efficiency. In this process, Model Partitioning/Parallelism scheme
plays a key role. On the one hand, the model can be decomposed among
multiple nodes by means of "tensor segmentation" or "computing power
pipelining", and each node is only responsible for a specific
submodule or specific slice. On the other hand, for reasoning
scenarios, the input data can also flow through a series of model
microservice nodes to form a pipelined processing mode, so as to make
full use of scattered computing resources. Through this strategy of
splitting the model into parallel execution, it can not only avoid
too much computing pressure on a single server, but also maximize the
use of GPU/CPU computing power of idle nodes when network resources
permit, so as to achieve balance and optimization between computing
network resources.

5.2. Data Collaboration Challenges under Block Isolation

In many distributed systems, large-scale data is usually split into
multiple data blocks, which are stored and processed separately.
Although this improves data security and processing efficiency, it
also brings challenges to data coordination. When multiple nodes or
microservice modules need to share or exchange data, the interface
and call sequence must be defined in advance, and the consistency and
concurrency control level must be managed. Especially when different
data blocks have cross-node dependencies, how to effectively
schedule, load and distribute data has become one of the key
bottlenecks of system scalability and computational efficiency.

Yang, et al. Expires 4 September 2025 [Page 7]

Internet-Draft DMSC Architecture March 2025

A key difficulty lies in synchronizing data across distributed nodes
while minimizing latency and avoiding bottlenecks. Cross-node
dependencies require precise scheduling to ensure data arrives at the
correct location and time without conflicts. As the scale of data
and the number of nodes grow, the management overhead for maintaining
these dependencies can increase exponentially, particularly when
network bandwidth or latency constraints exacerbate delays.
Additionally, ensuring data consistency across multiple data blocks
during concurrent access or updates adds another layer of complexity.
High levels of concurrency can increase the risk of inconsistencies,
data races, and synchronization issues, demanding advanced mechanisms
to enforce data integrity.

Traditional distributed communication strategies, such as AllReduce
and All-to-All, are widely used and remain effective in addressing
certain data collaboration needs in training and inference tasks.
For example, AllReduce is well-suited for data parallel scenarios,
where all nodes compute on the same model with different data splits,
and gradients or weights are synchronized via aggregation and
broadcast. Similarly, All-to-All is valuable in more complex
distributed tasks that require frequent intermediate data exchanges
across nodes. However, these methods are not without limitations.
As data and system complexity grow, they can lead to increased
communication overhead, especially in scenarios where synchronization
is uneven or poorly timed.

The effectiveness of traditional methods relies on fine tuning and
precise execution. Improper timing of data exchange can lead to long
waiting times, underutilization of resources, and even data mismatch.
Although approaches such as AllReduce and All-to-All provide reliable
communication frameworks, their scalability and efficiency are often
limited by challenges such as synchronization across nodes, network
variations, and system heterogeneity. Therefore, there is a need for
continuous improvement and innovation in distributed communication
and data collaboration strategies to overcome the challenges posed by
block isolation.

6. Distributed solution based on model segmentation

Based on the key problems and challenges, the microservice
communication resource scheduling architecture for distributed AI
models is proposed, which can be divided into four layers: business
layer, control layer, computing layer, and data layer. The
hierarchical relationship is shown in Figure 1. The specific
architecture diagram is shown in Figure. 2. The function module can
realize the soft cooperation of the control layer and the hard
isolation of the data layer, and the specific structure is shown in
Figure 3.

Yang, et al. Expires 4 September 2025 [Page 8]

Internet-Draft DMSC Architecture March 2025

Yang, et al. Expires 4 September 2025 [Page 9]

Internet-Draft DMSC Architecture March 2025

-----------------------------------------------------------------------------------------------------------------------------------------------------------
| ----------- ----------- ----------- ----------- |
| |Service A/1| |Service B/1| |Service A/2| |Service B/2| |
| -----|----- -----|----- -----|----- -----|----- |
| | | | | |
| | | | | |
| ----------------------------- ----------------------------- |
| | Microservices Gateway -1 | | Microservices Gateway -2 | |
| ------------|---------------- -----------|----------------- |
| | | |
| -----|----- -----|----- |
| | Interface | | Interface | |
| | address 1 |- - - - - - - - - - - - - - - - - - - - - - - - - -| address 2 |---------------------------------- |
| -----\----- -----/----- Address caching | |
| \ / | |
| \ / | |
| -------------------- --\------------- -------------/-- -------------------- | |
| | Functional modules |------| Service Router |------------------------| Service Router |-----| Functional modules | | |
| -------------------- -------\-------- --------/------- -------------------- | |
| \ / | |
| \ / ---------------------- |
| \ / -------- | Service Registration | |
| \ / | Feign | ---| and Discovery Centre | |
| \ / -------- ---------------------- |
| \ / | |
| \ / | |
| -------------------- --\----------/-- | |
| | Functional modules |------| Service Router | | |
| -------------------- --------|------- | |
| | | |
| | | |
| -----|----- | |
| | Interface | Address caching | |
| | address 3 |----------------------------------------------------------------- |
| -----|----- |
| | |
| ------------|---------------- |
| | Microservices Gateway -3 | |
| ----------------------------- |
| | | |
| | | |
| -----|----- -----|----- |
| |Service A/3| |Service B/3| |
| ----------- ----------- |
| |
| |
-----------------------------------------------------------------------------------------------------------------------------------------------------------

Yang, et al. Expires 4 September 2025 [Page 10]

Internet-Draft DMSC Architecture March 2025

RPC | REST API
|
-----------------------------------------------|---------------------------------------
| -|-* * *-----------------|---------------------|- |
| | Task management module | |
| -|---|-------------------|----------------------- |
| | | | |
| ------ | | |
| | | | |
| ---|-* * *-|-|-- | -|-* * *--|---|-----|--- |
| | Asynchronous | | | AI Model Segmentation | |
| | task queue | | | and aggregation | |
| | module | | | module | |
| ---|-* * *-|-|-- | -|-* * *--|---|-----|--- |
| | | | | | | |
| | | -------- | | | |
| | | | | | ------------- |
| -|-* * *----|-|- | | | |
| | Log management | | -|-* * *--|-|--- -|-* * *---|-|- |
| | system | | | Fault-tolerant | | Model storage | |
| ---------------- | | mechanism | | module | |
| | ---------------- -|-* * *---|-|- |
| | |
| Control layer | (Soft collaboration) |
------------------------------------------------|---------------------------------------
|
-----------------------------------------------|---------------------------------------
| | |
| | |
| -|-* * *--|-----|-|-- |
| | Distributed | |
| | unified cooperation | |
| | module | |
| -|-* * *---|----|-|-- |
| | |
| -|-* * *---|----|-|-- |
| | Load balancing | |
| | and resource | |
| |allocation mechanism | |
| -|-* * *---|----|-|-- |
| | |
| -|-* * *---|----|-|-- |
| | Computing power | |
| | execution module | |
| -|-* * *---|----|-|-- |
| | | |
| | ------------- |
| | | |

Yang, et al. Expires 4 September 2025 [Page 11]

Internet-Draft DMSC Architecture March 2025

| -|-* * *---|----|-|-- -|-* * *----|-|- |
| | Data | | Fault tolerance| |
| | management module | | and recovery | |
| -|-* * *---|----|-|-- | module | |
| | ---------------- |
| -|-* * *---|----|-|-- |
| | Computing power | |
| | resource pool | |
| -|-* * *---|----|-|-- |
| | |
| Computing layer | |
| | |
------------------------------------------------|--------------------------------------
------------------------------------------------|--------------------------------------
| | Packing data |
| | |
| -|-* * *-|-|- |
| Data layer | Database | |
| ------------- (Hard isolation) |
---------------------------------------------------------------------------------------

6.1. Business layer

The business layer is the core of the whole system and hosts the main
business logic and microservice components. It interacts with the
user-side front-end presentation layer, receives requests from
various channels, processes them according to models or business
rules, and returns the results to the upper layer or synchronizes
them to other microservices. Typically, the business layer is
deployed on a microservice container platform (such as Kubernetes),
managed by a service gateway or API gateway, and a service registry
and discovery center to maintain communication and load balancing
between microservices. Internal communication can include RPC, REST
API, or Feign based remote calls.

6.1.1. Microservices and Micromodels

Microservices and micromodels exist between multiple services (e.g.
"Service 1", "Service 2", "Service 3" and even "Service n") that
invoke each other at the business layer and logical layer. Each
service encapsulates a separate model or a functional slice of a
model, and when these services communicate with each other in the
real world via RPC, REST API, or internal event bus, the overall
effect of distributed micro-model coordination is formed. Through
service registration and discovery center, these micro-models can
automatically discover each other's available instances when needed,
so as to flexibly scale and balance computing power and network
resources in large-scale concurrent scenarios.

Yang, et al. Expires 4 September 2025 [Page 12]

Internet-Draft DMSC Architecture March 2025

6.1.2. Microservice Gateway and API Gateway

The microservice gateway and API gateway assume the function of
traffic scheduling and unified entry in the business layer. The
microservice gateway is mainly for internal service calls. Through
load balancing, routing rules and security policy configuration, the
communication between various business modules is more efficient and
stable.

API gateways face external clients or front-end layers, providing
users with a consistent HTTP or gRPC interface. At the same time, it
is responsible for authentication, flow limiting, fusing and
monitoring functions to ensure that when external requests surge, the
impact on internal services is controllable.

6.1.3. Service Registration and Discovery Center

The service registration and discovery center records the network
address, version information and health status of all available
microservices in the system, so that other modules or gateways in the
business layer can find the correct target instance in time when they
need to call the relevant microservices. For example, in the "real-
time recommendation and user behavior analysis" business, when the
"user portrait generation" microservice needs to be called, the
system will first inquire the load status and the available instance
list of the service from the registration and discovery center, and
then select the appropriate node to call according to the load
balancing strategy. This not only prevents a single point of
failure, but also automatically updates the routing information as
the microservice adds or loses instances.

Service registration and discovery center allows business function
modules to avoid manually maintaining complex service addresses or
dependencies. Each microservice only needs to actively register its
own information after startup, and when the execution goes offline or
crashes, the registry will update the state accordingly. Common
implementations include Eureka, Consul, Zookeeper, etc. These
registration and discovery centers can be deeply integrated with
microservice gateways or load balancing layers to achieve high
availability governance in distributed environments.

Each service registers the interface address to the registry, and the
service finds the interface address of the calling service through
the registry to initiate the use of the calling interface. The
interfaces are called peer-to-peer, and although there is a registry,
it only plays the role of controlling the flow.

Yang, et al. Expires 4 September 2025 [Page 13]

Internet-Draft DMSC Architecture March 2025

6.2. Control layer

The control layer is mainly responsible for scheduling and managing
various tasks and resources in the distributed AI system, including
task creation, allocation, exception handling, and key processes such
as model segmentation, training, and aggregation. Through the fine
design of the control layer, it can realize the parallel operation of
multiple models with high efficiency and high availability, and make
timely scheduling and fault tolerance when the computing power is
insufficient.

6.2.1. Task management module

The task management module is the "hub" of the control layer, which
is responsible for receiving different types of task requests from
the business layer or data layer, such as model training, model
reasoning, or batch data processing, etc., and allocating tasks to
nodes for execution according to real-time load conditions and
computing power resource information. The task management module
usually maintains a task queue or task priority queue to sort tasks
by FCFS (First come first served), FIFO (First in First out) or
weight-based scheduling policy. At the same time, the module
internally interfaces with a service registry and discovery center or
resource orchestration system (e.g. Kubernetes) to dynamically obtain
key metrics such as health, bandwidth, and memory usage of available
nodes (GPU/CPU). Some advanced implementations also use load
balancing strategies or node affinity algorithms to choose the best
location for tasks and trigger auto-scaling or resource recycling
when the overall cluster load reaches a threshold.

6.2.2. Exception task queue module

The exception task queue module plays the role of "fault buffer",
which is used to capture and store exceptions that occur during the
execution of tasks. In distributed AI systems, network jitter, node
failure or data exception often cause some tasks to fail or hang for
a long time. The exception task queue module is designed to collect
and isolate these abnormal tasks, so that they do not block the main
task queue and affect the overall performance. This module
continuously monitors the error logs and timeouts during the training
or inference process.When an exception is found, the detailed
information of the corresponding task (e.g., task ID, exception type,
execution log, etc.) is transferred to a separate exception queue and
recorded in the fault tracking system.

Yang, et al. Expires 4 September 2025 [Page 14]

Internet-Draft DMSC Architecture March 2025

6.2.3. Log management system

The log management module is responsible for tracking all critical
operations and events during the distributed training, inference and
scheduling process. This module usually uses a centralized log
storage and analysis framework to efficiently retrieve and aggregate
log data even when the system is large. This module not only records
the timestamps and execution results of events such as model
segmentation, computing power allocation, and communication
synchronization, but also collects hardware metrics (such as GPU
utilization, memory usage, and I/O throughput) of each node during
execution. When failure symptoms or performance bottlenecks are
detected in the logs, such as slow training or frequent node
timeouts, the log management module pushes the information to the
abnormal task queue module or alert system, which assists the
operations and Development teams to make timely diagnosis and
troubleshooting. Through the centralized management and visual
analysis of log data, it can also provide reliable data basis for
subsequent model optimization, resource budgeting and business
decision-making.

6.2.4. Model segmentation interface

This interface is mainly used to receive configuration information
related to segmentation strategy or algorithm. Through this
interface, the caller (e.g., a task management module, a business
layer, or a scheduling system) can specify the splitting mode (per
layer, per service, per block, etc.) and the corresponding parameter
restrictions for each policy, such as the range of the number of
layers to be split, the heuristic rules of the tabu search algorithm,
the number of shared layers for multiple tasks, and the privacy
protection requirements. The interface is typically provided in the
form of a REST API, gRPC, RPC, or messaging middleware, giving the
upstream system the flexibility to send or update policies.

6.2.5. Model segmentation module

Model segmentation is a key innovation in distributed AI
architectures, offering a more efficient and flexible way to allocate
computational resources and manage workloads. Within the control
layer, segmentation strategies are carefully selected based on
specific objectives, such as improving parallelism, optimizing
resource utilization, or meeting privacy requirements. These
strategies are tightly integrated into the system, with each
segmented component packaged as a modular microservice to ensure
seamless deployment and operation in distributed environments.FIG. 4
shows the framework diagram of the model segmentation and aggregation
module.

Yang, et al. Expires 4 September 2025 [Page 15]

Internet-Draft DMSC Architecture March 2025

Layer-based segmentation divides a model according to its structural
hierarchy, segmenting the network layer by layer. Each resulting
sub-model, typically consisting of one or more layers, is assigned to
different nodes for parallel execution. This method is particularly
effective for deep neural networks with significant depth and
computational complexity. For example, in a deep convolutional
neural network (CNN) for image classification, the initial
convolutional layers responsible for extracting features might be
executed on Node A, the intermediate fully connected layers on Node
B, and the output classification layer on Node C. To enhance
efficiency, heuristic or tabu search algorithms can determine optimal
segmentation points by considering factors like computational load,
inter-node communication overhead, and overall network latency. This
strategy is especially valuable in real-time inference scenarios,
such as autonomous driving, where computational throughput and low
latency are critical for decision-making.

Business segmentation is usually applied to multi-task learning
scenarios, where the same "backbone" model is derived into several
sub-models (or sub-tasks) according to business requirements, and the
co-training or inference of multiple tasks is realized by sharing
part of the network structure or parameters. For example, an
e-commerce platform may care about recommendations, AD click
prediction, and user personas at the same time, and these
requirements can be split into different "branches" on the "common
part" of the same model, which share feature extraction layers, and
each have task-specific output or fine-tuning layers.

Block-based segmentation provides maximum flexibility by dividing the
model into smaller, independent chunks of computation that can be
executed on separate nodes. Unlike layer-based or business-based
segmentation, this approach does not adhere to the structural
hierarchy or task boundaries of the model. Instead, it focuses on
resource adaptability and efficient computation in heterogeneous
environments. For example, in a federated learning system for
healthcare, hospitals can train local model blocks on sensitive
patient data. These blocks securely perform computations in the
field and only encrypted intermediate results are globally
aggregated. Similarly, in high-density cloud environments, block-
based segmentation can dynamically allocate computational tasks to
available hardware.

In addition to the above common segmentation methods, for scenarios
that need to take into account data privacy or compliance
requirements, privacy protection logic can also be built into the
segmentation strategy, such as putting sensitive data related
calculations into a separate secure node, or performing differential
privacy processing on gradient information and then aggregating.

Yang, et al. Expires 4 September 2025 [Page 16]

Internet-Draft DMSC Architecture March 2025

Through the multi-level and multi-angle model segmentation scheme,
the control layer can maximize the use of distributed computing
power, and flexibly schedule AI tasks in a multi-business and multi-
data source environment.

---------------------------------------------------------------------------
| ----------------------------------------- |
| --|-* * *---------|-|-- | Task requests are collected and stored | |
| | AI Model Segmentation | | | | |
| | and aggregation | --| The feature algorithm extracts | |
| | module | | the generated features | |
| --|-* * *---------|-|-- | | | |
| | The data matching algorithm | |
| | performs the task grouping | |
| ----------------------- | | | |
| | Layer segmentation | | Model training | |
| | Business segmentation |---| | | |
| | Block segmentation | | Model parameter aggregation | |
| ----------------------- ----------------------------------------- |
---------------------------------------------------------------------------

6.2.6. Model segmentation scheduling

After model segmentation, the control layer undertakes the key task
of scheduling the execution of the segmented sub-models. Scheduling
is more than just assigning tasks to nodes; It must optimize
collaboration efficiency, minimize resource idleness, and reduce data
bias across distributed systems. The scheduling process requires
careful consideration of factors such as task timing, resource
availability, data dependency, and system load to determine the
optimal execution order and synchronization strategy for each
submodel.

To manage incoming requests effectively, the scheduling algorithm
must decide how tasks are prioritized and allocated. For instance,
using a First Come, First Serve (FCFS) strategy ensures that tasks
are executed in the order they arrive. However, this approach may
leave some nodes underutilized if tasks vary significantly in
complexity or resource requirements. To address such inefficiencies,
advanced scheduling methods like priority queues or dynamic insertion
algorithms can be employed. These methods prioritize tasks based on
urgency, computational cost, or value to the system, ensuring that
high-priority or time-sensitive tasks are assigned computational
resources more quickly. For example, in a real-time fraud detection
system, high-risk transactions can be processed immediately by
prioritizing their execution, while lower-risk transactions are
queued for later.

Yang, et al. Expires 4 September 2025 [Page 17]

Internet-Draft DMSC Architecture March 2025

At the same time, in order to ensure the correctness and consistency
in the distributed environment, it is necessary to arrange the
appropriate communication time after each step of fragment
calculation to avoid data disorder or excessive delay. For those
scenarios where the training or inference process is very time-
sensitive, we can also reserve exclusive GPU/CPU nodes for critical
tasks at the scheduling level, or enable timing synchronization
mechanisms to ensure that all sub-models complete updates and
feedback in the same iteration cycle.

6.2.7. Model segmentation aggregation

Once all calculations distributed across different nodes or sub-
models are completed, the intermediate results or parameters must be
aggregated to produce the final output, whether it is a model
prediction result or updated model parameters. The aggregation
module plays a pivotal role in consolidating these outputs into a
unified result, ensuring consistency and accuracy in distributed AI
workflows.

The aggregation process typically employs strategies such as voting,
weighted averaging, or attention mechanisms to combine the outputs of
sub-models. For instance, in an ensemble-based recommendation
system, each sub-model might provide a recommendation score, and the
aggregation module could compute a weighted average based on the
performance or confidence of each sub-model. Similarly, in
distributed neural networks, attention mechanisms can be used to
assign different importance to outputs from various nodes, enabling
more precise aggregation based on task-specific contexts. These
strategies ensure that the aggregated result reflects the strengths
and contributions of individual sub-models while maintaining overall
coherence.

However, aggregation in distributed systems is inherently challenging
due to the possibility of node failures or delays. Network jitter,
node outages, or computation delays can prevent certain nodes from
returning their results in time, potentially disrupting the
aggregation process. To address this, the control layer incorporates
fault-tolerant mechanisms such as timeout retries, data playback, or
redundant computation strategies. For example, if a node fails to
provide its result within a specified time frame, the system might
either retry the computation on the same node or reassign the task to
a different node. In scenarios where redundancy is feasible,
multiple nodes can perform the same computation, ensuring that at
least one result is available for aggregation.

Yang, et al. Expires 4 September 2025 [Page 18]

Internet-Draft DMSC Architecture March 2025

The aggregation module also monitors system-wide performance to
evaluate the trade-off between computational benefits and
coordination overhead. By refining fault-tolerant logic and
aggregation strategies, the control layer ensures that the advantages
of distributed computationâ€”such as scalability and parallelismâ€”are
not offset by excessive synchronization or error-handling delays.
For example, in large-scale model training, the aggregation process
might include gradient averaging or parameter summation across nodes,
with mechanisms to handle delayed or missing gradients, ensuring that
the global model converges effectively despite intermittent node
failures.

6.3. Computing power layer

The computing layer is the execution core of the distributed
artificial intelligence system, which converts the strategies and
decisions of the control layer into actual calculations. This layer
processes tasks, manages resources, and executes distributed models
across nodes, ensuring that the computational benefits of model
segmentation are fully realized. By integrating advanced scheduling,
resource allocation and fault tolerance mechanisms, the computing
capacity layer ensures the efficient execution of tasks while
maintaining the stability of the system under dynamic loads.

The model segmentation strategy at the control layer determines how
the sub-models or operators are distributed over the nodes. The
computing layer, in turn, optimizes resource allocation and execution
to align with the segmentation design, ensuring that data
dependencies and computational workflows are effectively managed.
Through dynamic orchestration, parallel processing, and feedback
mechanisms, this layer provides high performance and scalability for
large-scale distributed AI systems.

6.3.1. Calculation of micro-model parameters

In the phase of micro-model parameter calculation, the computing
layer receives the scheduling instructions from the control layer and
obtains the aggregated model information provided by the distributed
unified cooperation module. The input usually includes the
structural description of the micromodel (e.g., different network
topologies such as convolutional networks, DNNS, Transformers, etc.),
and the corresponding data fragments or data blocks. In addition,
the compute layer takes into account the requirements of the business
layer, such as inference latency, training accuracy, and throughput,
to pre-allocate and schedule resources before execution.

Yang, et al. Expires 4 September 2025 [Page 19]

Internet-Draft DMSC Architecture March 2025

When the micromodel and data are ready, the computing power execution
module will load the corresponding operators into GPU, CPU or other
hardware acceleration units according to the pre-selected computing
framework (such as TensorFlow, PyTorch or self-developed lightweight
AI inference engine), and perform parallel computing according to the
parallelism configuration provided by the distributed unified
collaboration module. For larger convolutional layers or attention
mechanisms, the system may adopt AllReduce, All-to-All and other
modes to distribute computing tasks, and perform synchronization or
gradient updates after each iteration is completed. For the
lightweight AI model, the computing layer will give priority to the
nodes with fast response to meet the low latency application
scenarios. In the whole process, the load balancing and resource
allocation mechanism will monitor the load of each resource pool
(such as "computing power resource pool 1", "computing power resource
pool 2", etc.) in real time, and make dynamic adjustments when the
node has performance bottlenecks or idle resources, so as to reduce
the calculation waiting time and improve the overall throughput.

When the calculation is finished, the computing layer will summarize
the execution of each micro-model, generate records including
calculation delay, model metrics (such as Loss or Accuracy) and
hardware utilization, and archive these records through the data
management module to prepare for the next distributed computing power
parameter update.

6.3.2. Distributed computing power parameter update

In the stage of distributed computing power parameter update, the
computing layer needs to globally merge and synchronize the
intermediate results or model gradients calculated in the previous
step, and then feed back the updated model parameters to the control
layer or data layer. The input usually includes information such as
training gradients uploaded by each node, model weight chunks, and
node health status. The distributed unified coordination module
combines fault tolerance and recovery mechanisms to ensure that
parameters can be smoothly aggregated in the case of delay or failure
of some nodes.

Yang, et al. Expires 4 September 2025 [Page 20]

Internet-Draft DMSC Architecture March 2025

According to the business requirements and model scale, the computing
layer will choose the optimal parallel communication strategy, such
as Ring AllReduce, Tree AllReduce or gradient compression followed by
aggregation, to reduce network bandwidth consumption and accelerate
the synchronization of model parameters. In the scenario of large
model using Transformer or Attention structure, the computing layer
can allocate model parameters to different resource pools to be
updated in parallel with the help of block or pipeline parallel
technology, and then centralized and summarized to the master node or
master process.

After the distributed parameter update is completed, the computing
layer will send the final model weights or inference engine image
back to the control layer to be registered in the model warehouse as
the "latest version of the model", and may also synchronize some
intermediate features or labels to the data layer for subsequent
analysis. At the same time, the fault tolerance and recovery module
evaluates the stability and performance of the node according to the
monitoring data collected during the training and update process, and
provides a decision basis for the next iteration cycle or new task
scheduling.

6.3.3. Distributed unified Collaboration module

The distributed unified collaboration module is located at the core
of the entire computing layer, which is responsible for receiving and
integrating task instructions (such as model segmentation strategy,
training or inference goals, etc.) from the control layer, and
effectively docking with the underlying computing power resource
pool. Its inputs include information about the architecture of the
individual micromodels or aggregated models, the type of computation
to be performed (training or inference), and an overview of the
hardware available in the current cluster. The output is a global
choreography instruction for computing resources and computing
processes, which is used to guide the computing power execution
module and other functional modules to work together. A distributed
unified collaboration module will typically work with a service
registry or cluster orchestration system (e.g., Kubernetes, Yarn), or
may have a built-in distributed communication framework (e.g., NCCL,
Horovod) to manage and synchronize multiple Gpus or multiple nodes.
Its most prominent feature is that it can dynamically map different
sub-models or operators to the most appropriate node according to the
model block information and computing requirements, so that the
distributed computing can maintain higher throughput and scalability
in the multi-task and multi-model environment.

Yang, et al. Expires 4 September 2025 [Page 21]

Internet-Draft DMSC Architecture March 2025

6.3.4. Load balancing and resource allocation mechanism

The load balancing and resource allocation mechanism monitors the
load of each computing resource pool (such as GPU cluster, CPU
cluster, heterogeneous accelerator, etc.) in real time, and combines
the task scheduling strategy given by the distributed unified
collaboration module to decide how to distribute the computing load
between nodes. The input mainly includes the real-time status
information of each node (free degree, free memory, computing power
utilization) and the description of the hardware requirements of the
task to be assigned (e.g., how many Gpus are needed, whether mixed-
precision training is supported). The output is the specific node
allocation scheme and task routing instructions, which guide the
computing power execution module to deliver computing tasks to the
optimal location.

6.3.5. Computing power execution module

According to the instructions from the distributed unified
cooperation module and the load balancing module, the computing power
execution module loads the specific micro model or operator to the
corresponding node to run. The inputs include model parameters,
network topology, and data blocks, and the outputs are computed
inference results or intermediate training gradients. The module can
run on multiple servers through containerization (e.g., Docker,
Kubernetes Pods), and combine with AI frameworks (TensorFlow,
PyTorch, etc.) or self-developed inference engines to flexibly switch
execution environment and underlying computing power.

6.3.6. Data management module

The necessary characteristics, tags and metadata information are
transferred between the data management module and the control layer
or the business layer. The input sources usually include already
chunked or segmented data sets, as well as intermediate results
generated during model execution (e.g., local gradients, temporary
features, etc.). Output updated snapshots of model parameters, or
preprocessed feature data, for later use. The data management module
can support high concurrent reads and writes with the help of
distributed file system (HDFS), object storage (S3, etc.) or message
queue (Kafka, RabbitMQ). It also performs small-scale and high-
frequency data queries with database or cache systems.

Yang, et al. Expires 4 September 2025 [Page 22]

Internet-Draft DMSC Architecture March 2025

6.3.7. Fault tolerance and recovery module

The fault tolerance and recovery module continuously monitors the
heartbeat, load and network status of each node while the system is
running. Once an anomaly is detected, the fault information will be
reported to the distributed unified cooperation module, and the
automatic fault tolerance logic will be triggered. The inputs are
real-time cluster health data, task execution logs, and node failure
reports. The output is a series of decision instructions including
restarting tasks, reallocating resources, or rolling back to the last
stable snapshot. This often includes self-healing from automation
scripts (Ansible, salt, etc.) or cluster orchestration (Kubernetes),
or it may include a stop-start training process that records the
current iteration number and intermediate parameters when a crash
occurs and waits for the node to recover before continuing execution.

6.3.8. Computing resource pool

A pool of computing resources represents a collection of underlying
hardware that actually provides computing power. Each pool may
correspond to different types or specifications of hardware, such as
GPU server farms, CPU clusters, FPGA/ASIC accelerator cards, or even
hybrid computing power across cloud or local data centers. Their
inputs are usually task assignments and model execution requirements
from load balancing and resource allocation mechanisms, and their
outputs are inference results or training data after calculation, and
relevant performance indicators (such as temperature, power
consumption, throughput, etc.) are fed back to upper modules for
analysis.

6.4. Data layer

The data layer is the backbone of distributed AI systems, enabling
efficient data management while ensuring privacy protection,
scalability, and seamless integration with other layers, including
control, computing, and business layers. It plays a pivotal role in
storing, transmitting, and processing diverse datasets, supporting
distributed training, inference, and model segmentation workflows.
Through its robust design, the data layer balances security and
performance while maintaining the flexibility required by dynamic,
large-scale AI systems.

Yang, et al. Expires 4 September 2025 [Page 23]

Internet-Draft DMSC Architecture March 2025

6.4.1. Privacy protection

Privacy protection is at the core of the data layer, ensuring secure
data handling across the entire AI workflow. Multiple databases
(e.g., DB1, DB2, ..., DBn) store datasets from various business
domains or sensitivity levels, enabling the system to manage and
segregate data efficiently. For high-sensitivity scenarios, such as
healthcare or financial applications, only encrypted or desensitized
data fields are stored and transmitted. For instance, patient
medical records might be encrypted locally, and only aggregated
gradients or anonymized insights are shared during federated learning
tasks.

When the system executes model training or inference, the control
layer determines the appropriate data transmission strategy based on
predefined privacy policies. Federated learning ensures that raw
data remains localized, sharing only intermediate model gradients or
parameters, while differential privacy adds noise to data or
computations to prevent individual information leakage.

To further strengthen security, the data layer integrates advanced
privacy-preserving technologies, such as homomorphic encryption,
multi-party secure computation, and differential privacy injection.
These techniques enable micro-models and segmented workflows to
process data securely while complying with privacy regulations. For
instance, in a cross-database integration scenario, the data layer
ensures that access control policies and metadata updates prevent
unauthorized sharing of sensitive data, maintaining compliance
without hindering system performance.

6.4.2. Database maintenance and update

The data layer's database infrastructure ensures reliable storage,
high availability, and scalability, supporting the execution of
micro-models and model segmentation workflows. Distributed databases
are deployed to manage datasets associated with various system
segments, enabling parallel operations and efficient data
provisioning for training and inference tasks.

To handle high-concurrency environments, the data layer leverages
distributed database architectures such as NoSQL, NewSQL, and
relational databases, each selected based on the nature of the
workload:

NoSQL databases (e.g., HBase, Cassandra) are ideal for handling
unstructured or semi-structured data, such as logs and user behavior
data, offering high write throughput and horizontal scalability.

Yang, et al. Expires 4 September 2025 [Page 24]

Internet-Draft DMSC Architecture March 2025

NewSQL systems (e.g., TiDB) provide a hybrid solution, balancing
transactional consistency with scalability, making them suitable for
workloads requiring real-time updates, such as model parameter
synchronization.

Relational databases (e.g., MySQL, PostgreSQL) handle structured
datasets, such as model version histories or feature engineering
outputs, ensuring strong consistency and query efficiency.

The data layer ensures data consistency and fault tolerance through
mechanisms such as master-slave replication, shard-based
architectures, and automated failover. For example, if a database
shard responsible for storing training gradients becomes unavailable,
the system redirects queries to backup replicas or initiates a
failover process to restore service. Regular incremental backups and
disaster recovery protocols safeguard critical data against long-term
loss due to network or hardware failures.

Real-time monitoring tools, such as Prometheus and ELK Stack, track
database performance metrics, including query latency,
synchronization delays, and disk usage. If anomalies are detected,
automated alerts trigger recovery actions such as reallocating
workloads, rerouting queries, or scaling database resources to
prevent bottlenecks. For instance, during a high-demand scenario
like a shopping festival, the data layer may dynamically scale up
storage resources to accommodate surging user activity logs, ensuring
uninterrupted data availability for recommendation models.

7. IANA Considerations

TBD

8. Acknowledgement

TBD

9. References

9.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.

9.2. Informative References

[InfRef] "", 2004.

Yang, et al. Expires 4 September 2025 [Page 25]

Internet-Draft DMSC Architecture March 2025

Appendix A. An Appendix

Authors' Addresses

Hui Yang
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Email: yanghui@bupt.edu.cn

Tiankuo Yu
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Email: yutiankuo@bupt.edu.cn

Qiuyan Yao
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Email: yqy89716@bupt.edu.cn

Zepeng Zhang
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Email: 2024140574@bupt.cn

Yang, et al. Expires 4 September 2025 [Page 26]