DMSC Working Group                                               H. Yang
Internet-Draft                                                     T. Yu
Intended status: Standards Track                                  Q. Yao
Expires: 4 September 2025                                       Z. Zhang
                      Beijing University of Posts and Telecommunications
                                                            3 March 2025


Microservice Communication Resource Scheduling for Distributed AI Model
                  draft-yang-dmsc-distributed-model-03

Abstract

   This document describes the architecture of microservice
   communication resource scheduling for distributed AI model.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 4 September 2025.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.





Yang, et al.            Expires 4 September 2025                [Page 1]

Internet-Draft              DMSC Architecture                 March 2025


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Conventions used in this document . . . . . . . . . . . . . .   4
   3.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
   4.  Scenarios and requirements  . . . . . . . . . . . . . . . . .   4
     4.1.  AI Microservice model scenario requirements . . . . . . .   5
     4.2.  Distributed Micro model Service Flow  . . . . . . . . . .   6
   5.  Key issues and challenges . . . . . . . . . . . . . . . . . .   6
     5.1.  Balancing Compute and Network Resources under
           Constraints . . . . . . . . . . . . . . . . . . . . . . .   7
     5.2.  Data Collaboration Challenges under Block Isolation . . .   7
   6.  Distributed solution based on model segmentation  . . . . . .   8
     6.1.  Business layer  . . . . . . . . . . . . . . . . . . . . .  12
       6.1.1.  Microservices and Micromodels . . . . . . . . . . . .  12
       6.1.2.  Microservice Gateway and API Gateway  . . . . . . . .  13
       6.1.3.  Service Registration and Discovery Center . . . . . .  13
     6.2.  Control layer . . . . . . . . . . . . . . . . . . . . . .  14
       6.2.1.  Task management module  . . . . . . . . . . . . . . .  14
       6.2.2.  Exception task queue module . . . . . . . . . . . . .  14
       6.2.3.  Log management system . . . . . . . . . . . . . . . .  15
       6.2.4.  Model segmentation interface  . . . . . . . . . . . .  15
       6.2.5.  Model segmentation module . . . . . . . . . . . . . .  15
       6.2.6.  Model segmentation scheduling . . . . . . . . . . . .  17
       6.2.7.  Model segmentation aggregation  . . . . . . . . . . .  18
     6.3.  Computing power layer . . . . . . . . . . . . . . . . . .  19
       6.3.1.  Calculation of micro-model parameters . . . . . . . .  19
       6.3.2.  Distributed computing power parameter update  . . . .  20
       6.3.3.  Distributed unified Collaboration module  . . . . . .  21
       6.3.4.  Load balancing and resource allocation mechanism  . .  22
       6.3.5.  Computing power execution module  . . . . . . . . . .  22
       6.3.6.  Data management module  . . . . . . . . . . . . . . .  22
       6.3.7.  Fault tolerance and recovery module . . . . . . . . .  23
       6.3.8.  Computing resource pool . . . . . . . . . . . . . . .  23
     6.4.  Data layer  . . . . . . . . . . . . . . . . . . . . . . .  23
       6.4.1.  Privacy protection  . . . . . . . . . . . . . . . . .  24
       6.4.2.  Database maintenance and update . . . . . . . . . . .  24
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  25
   8.  Acknowledgement . . . . . . . . . . . . . . . . . . . . . . .  25
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  25
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  25
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  25
   Appendix A.  An Appendix  . . . . . . . . . . . . . . . . . . . .  26
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  26







Yang, et al.            Expires 4 September 2025                [Page 2]

Internet-Draft              DMSC Architecture                 March 2025


1.  Introduction

   The architecture of microservice communication resource scheduling
   for distributed AI model is a structured framework designed to
   address the challenges of scalability, flexibility, and efficiency in
   modern AI systems.  By integrating model segmentation, micro-model
   deployment, and microservice orchestration, this architecture enables
   the effective allocation and management of computing resources across
   distributed environments.  The primary focus lies in leveraging model
   segmentation to decompose large AI models into smaller, modular
   micro-models, which are executed collaboratively across distributed
   nodes.

   The architecture is organized into four tightly integrated layers,
   each with distinct roles and responsibilities that together ensure
   seamless functionality:

   Business Layer: This layer acts as the interface between the user-
   facing applications and the underlying system.  It encapsulates AI
   capabilities as microservices, enabling modular deployment, elastic
   scaling, and independent version control.  By routing user requests
   through service gateways, it ensures efficient interaction with back-
   end micro-models while balancing workloads.  The business layer also
   facilitates collaboration between multiple micro-models, allowing
   them to function as part of a cohesive distributed system.

   Control Layer: The control layer is the central coordination hub,
   responsible for task scheduling, resource allocation, and the
   implementation of model segmentation strategies.  It decomposes large
   AI models into smaller, manageable components, assigns tasks to
   specific nodes, and ensures synchronized execution across distributed
   environments.  This layer dynamically balances compute and network
   resources while adapting to system demands, ensuring high efficiency
   for training and inference workflows.

   Computing Layer: As the execution core, this layer translates the
   decisions made by the control layer into distributed computation.  It
   executes segmented micro-models on diverse hardware resources such as
   GPUs, CPUs, and accelerators, optimizing parallelism and fault
   tolerance.  By coordinating with the control layer, it ensures that
   tasks are executed efficiently while leveraging distributed
   orchestration frameworks to handle diverse workloads.









Yang, et al.            Expires 4 September 2025                [Page 3]

Internet-Draft              DMSC Architecture                 March 2025


   Data Layer: The data layer underpins the entire system by managing
   secure storage, access, and transmission of data.  It provides the
   necessary datasets, intermediate results, and metadata required for
   executing segmented micro-models.  Privacy protection mechanisms,
   such as federated learning and differential privacy, ensure data
   security and compliance, while distributed database operations
   guarantee consistent access and high availability across nodes.

   At the heart of this architecture is model segmentation, which serves
   as the foundation for effectively distributing computation and
   optimizing resource utilization.  The control layer breaks down
   models into smaller micro-models using strategies such as layer-
   based, business-specific, or block-based segmentation.  These micro-
   models are then deployed as independent services in the business
   layer, where they are dynamically scaled and orchestrated to meet
   real-time demands.  The computing layer executes these tasks using
   parallel processing techniques and advanced scheduling algorithms,
   while the data layer ensures secure and efficient data flow to
   support both training and inference tasks.

   By tightly integrating these layers, the architecture addresses
   critical challenges such as balancing compute and network resources,
   synchronizing distributed micro-models, and minimizing communication
   overhead.  This cohesive design enables AI systems to achieve high
   performance, scalability, and flexibility across dynamic and
   resource-intensive workloads.

   This document outlines the design principles, key components, and
   operational advantages of the microservice communication resource
   scheduling architecture for distributed AI models., emphasizing how
   model segmentation, micro-models, and microservices form the
   foundation for scalable and efficient distributed AI systems.

2.  Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

3.  Terminology

   TBD

4.  Scenarios and requirements







Yang, et al.            Expires 4 September 2025                [Page 4]

Internet-Draft              DMSC Architecture                 March 2025


4.1.  AI Microservice model scenario requirements

   In contemporary times, as artificial intelligence technology evolves
   at an accelerated pace, the scale and intricacy of AI models are
   continuously expanding.  The traditional monolithic application or
   centralized reasoning and training model is progressively becoming
   inadequate to meet the swiftly changing business demands.
   Encapsulating AI capabilities within a microservices architecture can
   confer substantial advantages in terms of system flexibility,
   scalability, and service governance.  By decoupling models through
   microservices, an independent AI model service can circumvent
   potential bottlenecks that arise from deep coupling with other
   business logic components, and it can also achieve elastic scaling
   during surges in requests or training loads.  Given the rapid
   iteration and upgrade cycles of AI models, a microservice
   architecture facilitates the coexistence of multiple model versions,
   enables gray-scale releases, and supports rapid rollbacks, thereby
   minimizing the impact on the overall system.

   The requirements for computing power in AI microservice models are
   often extremely demanding.  On the one hand, the training or
   inference process usually involves massive data processing and high-
   density parallel computing, requiring the collaborative work of
   various hardware resources such as GPU, CPU, FPGA, NPU, etc; On the
   other hand, if the model scale is large or the request volume is
   high, the computing power of a single machine is often insufficient
   to meet business needs.  It is necessary to perform parallel
   computing on multiple nodes through a distributed mode and release
   resources reasonably during idle time to improve utilization.  This
   type of distributed training or inference typically relies on
   efficient communication strategies to synchronize model parameters or
   gradients, and methods such as AllReduce or All to All are often used
   to reduce communication overhead and ensure model consistency.

   In the distributed system, the network plays a crucial role.  A large
   number of model parameters and gradients need to be exchanged
   frequently during the calculation process, which puts forward high
   requirements for network bandwidth and delay.  In the large-scale
   cluster scenario, the reasonable design of the network topology and
   the choice of the communication framework can not be ignored.  Only
   in the high-bandwidth, low-latency network environment, combined with
   the appropriate communication library (such as NCCL, MPI, etc.), can
   the cluster fully exploit the potential of computing power and avoid
   communication becoming the bottleneck of global performance.







Yang, et al.            Expires 4 September 2025                [Page 5]

Internet-Draft              DMSC Architecture                 March 2025


4.2.  Distributed Micro model Service Flow

   In the microservice communication resource scheduling architecture
   for distributed AI models, the core of the business process is how to
   realize the multi-node layout and collaborative work of the model to
   ensure efficient parameter synchronization and communication.
   Typically, a model is trained and evaluated using a deep learning
   framework during development, and then container-ized or mirrored to
   package the model and its dependencies into a service that can be
   deployed independently.  Then, these encapsulated model services are
   registered to the system's microservice management platform for
   subsequent unified scheduling and access.

   The micromodel is deployed to a distributed cluster, computing power
   orchestration and resource scheduling allocates computing resources
   such as GPU or CPU according to real-time load, business priority and
   hardware topology, and uses container orchestration tools (such as
   Kubernetes) to start corresponding service instances on each node.
   When distributed cooperation is needed, NCCL, Horovod and other
   frameworks are used to complete inter-process communication.
   Requests from upper business systems or users usually arrive at API
   Gateway or service gateway first, and then are distributed to the
   target service instance according to load balancing or other routing
   policies.  If distributed reasoning is needed, multiple nodes
   cooperate to perform model segmentation reasoning and summarize the
   results, and finally return the reasoning results to the requester.
   In this process, real-time monitoring and elastic scaling mechanism
   play an important role in ensuring system stability and optimizing
   resource utilization.  On the monitoring level, through a unified
   data acquisition and analysis platform, the system can track core
   indicators such as GPU utilization, network traffic, and request
   latency of each service node, so as to provide timely alarms in case
   of failures, performance bottlenecks, or insufficient resources, and
   perform automatic failover or node offline processing.

   In addition, the distributed micromodel business flow needs to be
   combined with the data backflow mechanism.  A large number of logs,
   user feedbacks and interactive information generated in the inference
   process can be further used for the training of new models or the
   performance optimization of existing models if they can be returned
   to the data platform under the premise of meeting privacy and
   compliance requirements.

5.  Key issues and challenges







Yang, et al.            Expires 4 September 2025                [Page 6]

Internet-Draft              DMSC Architecture                 March 2025


5.1.  Balancing Compute and Network Resources under Constraints

   With the continuous growth of AI model size and business demand, the
   computing power resources of a single node or single cluster are
   often difficult to support high-intensity training and inference
   tasks, and it is prone to the problem of insufficient computing power
   or sharp rise in cost.  Through the distributed architecture to
   coordinate computing resources between multiple nodes and multiple
   regions, it can improve the overall efficiency and fault tolerance to
   a certain extent.  However, distributed deployment also brings higher
   complexity, which not only considers the differences of heterogeneous
   hardware (such as GPU, CPU, FPGA, etc.), but also needs to balance
   the allocation of computing power under different network topology
   and bandwidth conditions.

   When computing network resources are scarce, it is necessary to
   dynamically schedule and allocate computing power according to
   business priority, model scale and real-time load conditions, and
   combine strategic queuing, elastic scaling and scaling, and cross-
   cluster resource collaboration to improve the overall service
   efficiency.  In this process, Model Partitioning/Parallelism scheme
   plays a key role.  On the one hand, the model can be decomposed among
   multiple nodes by means of "tensor segmentation" or "computing power
   pipelining", and each node is only responsible for a specific
   submodule or specific slice.  On the other hand, for reasoning
   scenarios, the input data can also flow through a series of model
   microservice nodes to form a pipelined processing mode, so as to make
   full use of scattered computing resources.  Through this strategy of
   splitting the model into parallel execution, it can not only avoid
   too much computing pressure on a single server, but also maximize the
   use of GPU/CPU computing power of idle nodes when network resources
   permit, so as to achieve balance and optimization between computing
   network resources.

5.2.  Data Collaboration Challenges under Block Isolation

   In many distributed systems, large-scale data is usually split into
   multiple data blocks, which are stored and processed separately.
   Although this improves data security and processing efficiency, it
   also brings challenges to data coordination.  When multiple nodes or
   microservice modules need to share or exchange data, the interface
   and call sequence must be defined in advance, and the consistency and
   concurrency control level must be managed.  Especially when different
   data blocks have cross-node dependencies, how to effectively
   schedule, load and distribute data has become one of the key
   bottlenecks of system scalability and computational efficiency.





Yang, et al.            Expires 4 September 2025                [Page 7]

Internet-Draft              DMSC Architecture                 March 2025


   A key difficulty lies in synchronizing data across distributed nodes
   while minimizing latency and avoiding bottlenecks.  Cross-node
   dependencies require precise scheduling to ensure data arrives at the
   correct location and time without conflicts.  As the scale of data
   and the number of nodes grow, the management overhead for maintaining
   these dependencies can increase exponentially, particularly when
   network bandwidth or latency constraints exacerbate delays.
   Additionally, ensuring data consistency across multiple data blocks
   during concurrent access or updates adds another layer of complexity.
   High levels of concurrency can increase the risk of inconsistencies,
   data races, and synchronization issues, demanding advanced mechanisms
   to enforce data integrity.

   Traditional distributed communication strategies, such as AllReduce
   and All-to-All, are widely used and remain effective in addressing
   certain data collaboration needs in training and inference tasks.
   For example, AllReduce is well-suited for data parallel scenarios,
   where all nodes compute on the same model with different data splits,
   and gradients or weights are synchronized via aggregation and
   broadcast.  Similarly, All-to-All is valuable in more complex
   distributed tasks that require frequent intermediate data exchanges
   across nodes.  However, these methods are not without limitations.
   As data and system complexity grow, they can lead to increased
   communication overhead, especially in scenarios where synchronization
   is uneven or poorly timed.

   The effectiveness of traditional methods relies on fine tuning and
   precise execution.  Improper timing of data exchange can lead to long
   waiting times, underutilization of resources, and even data mismatch.
   Although approaches such as AllReduce and All-to-All provide reliable
   communication frameworks, their scalability and efficiency are often
   limited by challenges such as synchronization across nodes, network
   variations, and system heterogeneity.  Therefore, there is a need for
   continuous improvement and innovation in distributed communication
   and data collaboration strategies to overcome the challenges posed by
   block isolation.

6.  Distributed solution based on model segmentation

   Based on the key problems and challenges, the microservice
   communication resource scheduling architecture for distributed AI
   models is proposed, which can be divided into four layers: business
   layer, control layer, computing layer, and data layer.  The
   hierarchical relationship is shown in Figure 1.  The specific
   architecture diagram is shown in Figure. 2.  The function module can
   realize the soft cooperation of the control layer and the hard
   isolation of the data layer, and the specific structure is shown in
   Figure 3.



Yang, et al.            Expires 4 September 2025                [Page 8]

Internet-Draft              DMSC Architecture                 March 2025


    ---------------------------------
   |          Business layer         |
   |                 |               |
   |           Control layer         |
   |                 |               |
   |          Computing layer        |
   |                 |               |
   |             Data layer          |
    ---------------------------------










































Yang, et al.            Expires 4 September 2025                [Page 9]

Internet-Draft              DMSC Architecture                 March 2025


 -----------------------------------------------------------------------------------------------------------------------------------------------------------
|                       -----------      -----------                                      -----------      -----------                                      |
|                      |Service A/1|    |Service B/1|                                    |Service A/2|    |Service B/2|                                     |
|                       -----|-----      -----|-----                                      -----|-----      -----|-----                                      |
|                            |                |                                                |                |                                           |
|                            |                |                                                |                |                                           |
|                       -----------------------------                                    -----------------------------                                      |
|                      |  Microservices Gateway -1   |                                  |  Microservices Gateway -2   |                                     |
|                       ------------|----------------                                    -----------|-----------------                                      |
|                                   |                                                               |                                                       |
|                              -----|-----                                                     -----|-----                                                  |
|                             | Interface |                                                   | Interface |                                                 |
|                             | address 1 |- - - - - - - - - - - - - - - - - - - - - - - - - -| address 2 |----------------------------------               |
|                              -----\-----                                                     -----/-----            Address caching        |              |
|                                     \                                                            /                                         |              |
|                                       \                                                        /                                           |              |
|           --------------------        --\-------------                          -------------/--       --------------------                |              |
|          | Functional modules |------| Service Router |------------------------| Service Router |-----| Functional modules |               |              |
|           --------------------        -------\--------                          --------/-------       --------------------                |              |
|                                                \                                      /                                                    |              |
|                                                  \                                  /                                            ----------------------   |
|                                                    \                              /                                --------     | Service Registration |  |
|                                                      \                          /                                 |  Feign | ---| and Discovery Centre |  |
|                                                        \                      /                                    --------      ----------------------   |
|                                                          \                  /                                                              |              |
|                                                            \              /                                                                |              |
|                                --------------------        --\----------/--                                                                |              |
|                               | Functional modules |------| Service Router |                                                               |              |
|                                --------------------        --------|-------                                                                |              |
|                                                                    |                                                                       |              |
|                                                                    |                                                                       |              |
|                                                               -----|-----                                                                  |              |
|                                                              | Interface |                                           Address caching       |              |
|                                                              | address 3 |-----------------------------------------------------------------               |
|                                                               -----|-----                                                                                 |
|                                                                    |                                                                                      |
|                                                        ------------|----------------                                                                      |
|                                                       |  Microservices Gateway -3   |                                                                     |
|                                                        -----------------------------                                                                      |
|                                                             |                |                                                                            |
|                                                             |                |                                                                            |
|                                                        -----|-----      -----|-----                                                                       |
|                                                       |Service A/3|    |Service B/3|                                                                      |
|                                                        -----------      -----------                                                                       |
|                                                                                                                                                           |
|                                                                                                                                                           |
 -----------------------------------------------------------------------------------------------------------------------------------------------------------




Yang, et al.            Expires 4 September 2025               [Page 10]

Internet-Draft              DMSC Architecture                 March 2025


                                           RPC  | REST API
                                                |
 -----------------------------------------------|---------------------------------------
|                      -|-* * *-----------------|---------------------|-                |
|                     |        Task         management      module      |               |
|                      -|---|-------------------|-----------------------                |
|                       |   |                   |                                       |
|                 ------    |                   |                                       |
|                |          |                   |                                       |
|     ---|-* * *-|-|--      |         -|-* * *--|---|-----|---                          |
|    |  Asynchronous  |     |        |  AI Model Segmentation |                         |
|    |   task queue   |     |        |     and aggregation    |                         |
|    |     module     |     |        |          module        |                         |
|     ---|-* * *-|-|--      |         -|-* * *--|---|-----|---                          |
|                |          |          |        |   |     |                             |
|                |          |  --------         |   |     |                             |
|                |          | |                 |   |      -------------                |
|               -|-* * *----|-|-                |   |                   |               |
|              | Log management |               |  -|-* * *--|-|---    -|-* * *---|-|-  |
|              |    system      |               | | Fault-tolerant |  | Model storage | |
|               ----------------                | |    mechanism   |  |     module    | |
|                                               |  ----------------    -|-* * *---|-|-  |
|                                               |                                       |
|    Control layer                              |           (Soft collaboration)        |
------------------------------------------------|---------------------------------------
                                                |
 -----------------------------------------------|---------------------------------------
|                                               |                                       |
|                                               |                                       |
|                                     -|-* * *--|-----|-|--                             |
|                                    |     Distributed     |                            |
|                                    | unified cooperation |                            |
|                                    |       module        |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Load balancing    |                            |
|                                    |    and resource     |                            |
|                                    |allocation mechanism |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Computing power   |                            |
|                                    |  execution  module  |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |    |                                 |
|                                                |     -------------                    |
|                                                |                  |                   |



Yang, et al.            Expires 4 September 2025               [Page 11]

Internet-Draft              DMSC Architecture                 March 2025


|                                     -|-* * *---|----|-|--        -|-* * *----|-|-     |
|                                     |         Data       |      | Fault tolerance|    |
|                                     |  management module |      |  and recovery  |    |
|                                     -|-* * *---|----|-|--       |     module     |    |
|                                                |                 ----------------     |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Computing power   |                            |
|                                    |    resource pool    |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|  Computing layer                               |                                      |
|                                                |                                      |
 ------------------------------------------------|--------------------------------------
 ------------------------------------------------|--------------------------------------
|                                                | Packing data                         |
|                                                |                                      |
|                                       -|-* * *-|-|-                                   |
|       Data layer                     |   Database  |                                  |
|                                       -------------               (Hard isolation)    |
 ---------------------------------------------------------------------------------------

6.1.  Business layer

   The business layer is the core of the whole system and hosts the main
   business logic and microservice components.  It interacts with the
   user-side front-end presentation layer, receives requests from
   various channels, processes them according to models or business
   rules, and returns the results to the upper layer or synchronizes
   them to other microservices.  Typically, the business layer is
   deployed on a microservice container platform (such as Kubernetes),
   managed by a service gateway or API gateway, and a service registry
   and discovery center to maintain communication and load balancing
   between microservices.  Internal communication can include RPC, REST
   API, or Feign based remote calls.

6.1.1.  Microservices and Micromodels

   Microservices and micromodels exist between multiple services (e.g.
   "Service 1", "Service 2", "Service 3" and even "Service n") that
   invoke each other at the business layer and logical layer.  Each
   service encapsulates a separate model or a functional slice of a
   model, and when these services communicate with each other in the
   real world via RPC, REST API, or internal event bus, the overall
   effect of distributed micro-model coordination is formed.  Through
   service registration and discovery center, these micro-models can
   automatically discover each other's available instances when needed,
   so as to flexibly scale and balance computing power and network
   resources in large-scale concurrent scenarios.



Yang, et al.            Expires 4 September 2025               [Page 12]

Internet-Draft              DMSC Architecture                 March 2025


6.1.2.  Microservice Gateway and API Gateway

   The microservice gateway and API gateway assume the function of
   traffic scheduling and unified entry in the business layer.  The
   microservice gateway is mainly for internal service calls.  Through
   load balancing, routing rules and security policy configuration, the
   communication between various business modules is more efficient and
   stable.

   API gateways face external clients or front-end layers, providing
   users with a consistent HTTP or gRPC interface.  At the same time, it
   is responsible for authentication, flow limiting, fusing and
   monitoring functions to ensure that when external requests surge, the
   impact on internal services is controllable.

6.1.3.  Service Registration and Discovery Center

   The service registration and discovery center records the network
   address, version information and health status of all available
   microservices in the system, so that other modules or gateways in the
   business layer can find the correct target instance in time when they
   need to call the relevant microservices.  For example, in the "real-
   time recommendation and user behavior analysis" business, when the
   "user portrait generation" microservice needs to be called, the
   system will first inquire the load status and the available instance
   list of the service from the registration and discovery center, and
   then select the appropriate node to call according to the load
   balancing strategy.  This not only prevents a single point of
   failure, but also automatically updates the routing information as
   the microservice adds or loses instances.

   Service registration and discovery center allows business function
   modules to avoid manually maintaining complex service addresses or
   dependencies.  Each microservice only needs to actively register its
   own information after startup, and when the execution goes offline or
   crashes, the registry will update the state accordingly.  Common
   implementations include Eureka, Consul, Zookeeper, etc.  These
   registration and discovery centers can be deeply integrated with
   microservice gateways or load balancing layers to achieve high
   availability governance in distributed environments.

   Each service registers the interface address to the registry, and the
   service finds the interface address of the calling service through
   the registry to initiate the use of the calling interface.  The
   interfaces are called peer-to-peer, and although there is a registry,
   it only plays the role of controlling the flow.





Yang, et al.            Expires 4 September 2025               [Page 13]

Internet-Draft              DMSC Architecture                 March 2025


6.2.  Control layer

   The control layer is mainly responsible for scheduling and managing
   various tasks and resources in the distributed AI system, including
   task creation, allocation, exception handling, and key processes such
   as model segmentation, training, and aggregation.  Through the fine
   design of the control layer, it can realize the parallel operation of
   multiple models with high efficiency and high availability, and make
   timely scheduling and fault tolerance when the computing power is
   insufficient.

6.2.1.  Task management module

   The task management module is the "hub" of the control layer, which
   is responsible for receiving different types of task requests from
   the business layer or data layer, such as model training, model
   reasoning, or batch data processing, etc., and allocating tasks to
   nodes for execution according to real-time load conditions and
   computing power resource information.  The task management module
   usually maintains a task queue or task priority queue to sort tasks
   by FCFS (First come first served), FIFO (First in First out) or
   weight-based scheduling policy.  At the same time, the module
   internally interfaces with a service registry and discovery center or
   resource orchestration system (e.g. Kubernetes) to dynamically obtain
   key metrics such as health, bandwidth, and memory usage of available
   nodes (GPU/CPU).  Some advanced implementations also use load
   balancing strategies or node affinity algorithms to choose the best
   location for tasks and trigger auto-scaling or resource recycling
   when the overall cluster load reaches a threshold.

6.2.2.  Exception task queue module

   The exception task queue module plays the role of "fault buffer",
   which is used to capture and store exceptions that occur during the
   execution of tasks.  In distributed AI systems, network jitter, node
   failure or data exception often cause some tasks to fail or hang for
   a long time.  The exception task queue module is designed to collect
   and isolate these abnormal tasks, so that they do not block the main
   task queue and affect the overall performance.  This module
   continuously monitors the error logs and timeouts during the training
   or inference process.When an exception is found, the detailed
   information of the corresponding task (e.g., task ID, exception type,
   execution log, etc.) is transferred to a separate exception queue and
   recorded in the fault tracking system.







Yang, et al.            Expires 4 September 2025               [Page 14]

Internet-Draft              DMSC Architecture                 March 2025


6.2.3.  Log management system

   The log management module is responsible for tracking all critical
   operations and events during the distributed training, inference and
   scheduling process.  This module usually uses a centralized log
   storage and analysis framework to efficiently retrieve and aggregate
   log data even when the system is large.  This module not only records
   the timestamps and execution results of events such as model
   segmentation, computing power allocation, and communication
   synchronization, but also collects hardware metrics (such as GPU
   utilization, memory usage, and I/O throughput) of each node during
   execution.  When failure symptoms or performance bottlenecks are
   detected in the logs, such as slow training or frequent node
   timeouts, the log management module pushes the information to the
   abnormal task queue module or alert system, which assists the
   operations and Development teams to make timely diagnosis and
   troubleshooting.  Through the centralized management and visual
   analysis of log data, it can also provide reliable data basis for
   subsequent model optimization, resource budgeting and business
   decision-making.

6.2.4.  Model segmentation interface

   This interface is mainly used to receive configuration information
   related to segmentation strategy or algorithm.  Through this
   interface, the caller (e.g., a task management module, a business
   layer, or a scheduling system) can specify the splitting mode (per
   layer, per service, per block, etc.) and the corresponding parameter
   restrictions for each policy, such as the range of the number of
   layers to be split, the heuristic rules of the tabu search algorithm,
   the number of shared layers for multiple tasks, and the privacy
   protection requirements.  The interface is typically provided in the
   form of a REST API, gRPC, RPC, or messaging middleware, giving the
   upstream system the flexibility to send or update policies.

6.2.5.  Model segmentation module

   Model segmentation is a key innovation in distributed AI
   architectures, offering a more efficient and flexible way to allocate
   computational resources and manage workloads.  Within the control
   layer, segmentation strategies are carefully selected based on
   specific objectives, such as improving parallelism, optimizing
   resource utilization, or meeting privacy requirements.  These
   strategies are tightly integrated into the system, with each
   segmented component packaged as a modular microservice to ensure
   seamless deployment and operation in distributed environments.FIG. 4
   shows the framework diagram of the model segmentation and aggregation
   module.



Yang, et al.            Expires 4 September 2025               [Page 15]

Internet-Draft              DMSC Architecture                 March 2025


   Layer-based segmentation divides a model according to its structural
   hierarchy, segmenting the network layer by layer.  Each resulting
   sub-model, typically consisting of one or more layers, is assigned to
   different nodes for parallel execution.  This method is particularly
   effective for deep neural networks with significant depth and
   computational complexity.  For example, in a deep convolutional
   neural network (CNN) for image classification, the initial
   convolutional layers responsible for extracting features might be
   executed on Node A, the intermediate fully connected layers on Node
   B, and the output classification layer on Node C.  To enhance
   efficiency, heuristic or tabu search algorithms can determine optimal
   segmentation points by considering factors like computational load,
   inter-node communication overhead, and overall network latency.  This
   strategy is especially valuable in real-time inference scenarios,
   such as autonomous driving, where computational throughput and low
   latency are critical for decision-making.

   Business segmentation is usually applied to multi-task learning
   scenarios, where the same "backbone" model is derived into several
   sub-models (or sub-tasks) according to business requirements, and the
   co-training or inference of multiple tasks is realized by sharing
   part of the network structure or parameters.  For example, an
   e-commerce platform may care about recommendations, AD click
   prediction, and user personas at the same time, and these
   requirements can be split into different "branches" on the "common
   part" of the same model, which share feature extraction layers, and
   each have task-specific output or fine-tuning layers.

   Block-based segmentation provides maximum flexibility by dividing the
   model into smaller, independent chunks of computation that can be
   executed on separate nodes.  Unlike layer-based or business-based
   segmentation, this approach does not adhere to the structural
   hierarchy or task boundaries of the model.  Instead, it focuses on
   resource adaptability and efficient computation in heterogeneous
   environments.  For example, in a federated learning system for
   healthcare, hospitals can train local model blocks on sensitive
   patient data.  These blocks securely perform computations in the
   field and only encrypted intermediate results are globally
   aggregated.  Similarly, in high-density cloud environments, block-
   based segmentation can dynamically allocate computational tasks to
   available hardware.

   In addition to the above common segmentation methods, for scenarios
   that need to take into account data privacy or compliance
   requirements, privacy protection logic can also be built into the
   segmentation strategy, such as putting sensitive data related
   calculations into a separate secure node, or performing differential
   privacy processing on gradient information and then aggregating.



Yang, et al.            Expires 4 September 2025               [Page 16]

Internet-Draft              DMSC Architecture                 March 2025


   Through the multi-level and multi-angle model segmentation scheme,
   the control layer can maximize the use of distributed computing
   power, and flexibly schedule AI tasks in a multi-business and multi-
   data source environment.

 ---------------------------------------------------------------------------
|                                -----------------------------------------  |
|    --|-* * *---------|-|--    | Task requests are collected and stored  | |
|   | AI Model Segmentation |   |                       |                 | |
|   |    and aggregation    | --|      The feature algorithm extracts     | |
|   |         module        |   |           the generated features        | |
|    --|-* * *---------|-|--    |                       |                 | |
|                               |          The data matching algorithm    | |
|                               |           performs the task grouping    | |
|    -----------------------    |                       |                 | |
|   | Layer segmentation    |   |                 Model training          | |
|   | Business segmentation |---|                       |                 | |
|   | Block segmentation    |   |         Model parameter aggregation     | |
|    -----------------------     -----------------------------------------  |
 ---------------------------------------------------------------------------

6.2.6.  Model segmentation scheduling

   After model segmentation, the control layer undertakes the key task
   of scheduling the execution of the segmented sub-models.  Scheduling
   is more than just assigning tasks to nodes; It must optimize
   collaboration efficiency, minimize resource idleness, and reduce data
   bias across distributed systems.  The scheduling process requires
   careful consideration of factors such as task timing, resource
   availability, data dependency, and system load to determine the
   optimal execution order and synchronization strategy for each
   submodel.

   To manage incoming requests effectively, the scheduling algorithm
   must decide how tasks are prioritized and allocated.  For instance,
   using a First Come, First Serve (FCFS) strategy ensures that tasks
   are executed in the order they arrive.  However, this approach may
   leave some nodes underutilized if tasks vary significantly in
   complexity or resource requirements.  To address such inefficiencies,
   advanced scheduling methods like priority queues or dynamic insertion
   algorithms can be employed.  These methods prioritize tasks based on
   urgency, computational cost, or value to the system, ensuring that
   high-priority or time-sensitive tasks are assigned computational
   resources more quickly.  For example, in a real-time fraud detection
   system, high-risk transactions can be processed immediately by
   prioritizing their execution, while lower-risk transactions are
   queued for later.




Yang, et al.            Expires 4 September 2025               [Page 17]

Internet-Draft              DMSC Architecture                 March 2025


   At the same time, in order to ensure the correctness and consistency
   in the distributed environment, it is necessary to arrange the
   appropriate communication time after each step of fragment
   calculation to avoid data disorder or excessive delay.  For those
   scenarios where the training or inference process is very time-
   sensitive, we can also reserve exclusive GPU/CPU nodes for critical
   tasks at the scheduling level, or enable timing synchronization
   mechanisms to ensure that all sub-models complete updates and
   feedback in the same iteration cycle.

6.2.7.  Model segmentation aggregation

   Once all calculations distributed across different nodes or sub-
   models are completed, the intermediate results or parameters must be
   aggregated to produce the final output, whether it is a model
   prediction result or updated model parameters.  The aggregation
   module plays a pivotal role in consolidating these outputs into a
   unified result, ensuring consistency and accuracy in distributed AI
   workflows.

   The aggregation process typically employs strategies such as voting,
   weighted averaging, or attention mechanisms to combine the outputs of
   sub-models.  For instance, in an ensemble-based recommendation
   system, each sub-model might provide a recommendation score, and the
   aggregation module could compute a weighted average based on the
   performance or confidence of each sub-model.  Similarly, in
   distributed neural networks, attention mechanisms can be used to
   assign different importance to outputs from various nodes, enabling
   more precise aggregation based on task-specific contexts.  These
   strategies ensure that the aggregated result reflects the strengths
   and contributions of individual sub-models while maintaining overall
   coherence.

   However, aggregation in distributed systems is inherently challenging
   due to the possibility of node failures or delays.  Network jitter,
   node outages, or computation delays can prevent certain nodes from
   returning their results in time, potentially disrupting the
   aggregation process.  To address this, the control layer incorporates
   fault-tolerant mechanisms such as timeout retries, data playback, or
   redundant computation strategies.  For example, if a node fails to
   provide its result within a specified time frame, the system might
   either retry the computation on the same node or reassign the task to
   a different node.  In scenarios where redundancy is feasible,
   multiple nodes can perform the same computation, ensuring that at
   least one result is available for aggregation.






Yang, et al.            Expires 4 September 2025               [Page 18]

Internet-Draft              DMSC Architecture                 March 2025


   The aggregation module also monitors system-wide performance to
   evaluate the trade-off between computational benefits and
   coordination overhead.  By refining fault-tolerant logic and
   aggregation strategies, the control layer ensures that the advantages
   of distributed computation—such as scalability and parallelism—are
   not offset by excessive synchronization or error-handling delays.
   For example, in large-scale model training, the aggregation process
   might include gradient averaging or parameter summation across nodes,
   with mechanisms to handle delayed or missing gradients, ensuring that
   the global model converges effectively despite intermittent node
   failures.

6.3.  Computing power layer

   The computing layer is the execution core of the distributed
   artificial intelligence system, which converts the strategies and
   decisions of the control layer into actual calculations.  This layer
   processes tasks, manages resources, and executes distributed models
   across nodes, ensuring that the computational benefits of model
   segmentation are fully realized.  By integrating advanced scheduling,
   resource allocation and fault tolerance mechanisms, the computing
   capacity layer ensures the efficient execution of tasks while
   maintaining the stability of the system under dynamic loads.

   The model segmentation strategy at the control layer determines how
   the sub-models or operators are distributed over the nodes.  The
   computing layer, in turn, optimizes resource allocation and execution
   to align with the segmentation design, ensuring that data
   dependencies and computational workflows are effectively managed.
   Through dynamic orchestration, parallel processing, and feedback
   mechanisms, this layer provides high performance and scalability for
   large-scale distributed AI systems.

6.3.1.  Calculation of micro-model parameters

   In the phase of micro-model parameter calculation, the computing
   layer receives the scheduling instructions from the control layer and
   obtains the aggregated model information provided by the distributed
   unified cooperation module.  The input usually includes the
   structural description of the micromodel (e.g., different network
   topologies such as convolutional networks, DNNS, Transformers, etc.),
   and the corresponding data fragments or data blocks.  In addition,
   the compute layer takes into account the requirements of the business
   layer, such as inference latency, training accuracy, and throughput,
   to pre-allocate and schedule resources before execution.






Yang, et al.            Expires 4 September 2025               [Page 19]

Internet-Draft              DMSC Architecture                 March 2025


   When the micromodel and data are ready, the computing power execution
   module will load the corresponding operators into GPU, CPU or other
   hardware acceleration units according to the pre-selected computing
   framework (such as TensorFlow, PyTorch or self-developed lightweight
   AI inference engine), and perform parallel computing according to the
   parallelism configuration provided by the distributed unified
   collaboration module.  For larger convolutional layers or attention
   mechanisms, the system may adopt AllReduce, All-to-All and other
   modes to distribute computing tasks, and perform synchronization or
   gradient updates after each iteration is completed.  For the
   lightweight AI model, the computing layer will give priority to the
   nodes with fast response to meet the low latency application
   scenarios.  In the whole process, the load balancing and resource
   allocation mechanism will monitor the load of each resource pool
   (such as "computing power resource pool 1", "computing power resource
   pool 2", etc.) in real time, and make dynamic adjustments when the
   node has performance bottlenecks or idle resources, so as to reduce
   the calculation waiting time and improve the overall throughput.

   When the calculation is finished, the computing layer will summarize
   the execution of each micro-model, generate records including
   calculation delay, model metrics (such as Loss or Accuracy) and
   hardware utilization, and archive these records through the data
   management module to prepare for the next distributed computing power
   parameter update.

6.3.2.  Distributed computing power parameter update

   In the stage of distributed computing power parameter update, the
   computing layer needs to globally merge and synchronize the
   intermediate results or model gradients calculated in the previous
   step, and then feed back the updated model parameters to the control
   layer or data layer.  The input usually includes information such as
   training gradients uploaded by each node, model weight chunks, and
   node health status.  The distributed unified coordination module
   combines fault tolerance and recovery mechanisms to ensure that
   parameters can be smoothly aggregated in the case of delay or failure
   of some nodes.













Yang, et al.            Expires 4 September 2025               [Page 20]

Internet-Draft              DMSC Architecture                 March 2025


   According to the business requirements and model scale, the computing
   layer will choose the optimal parallel communication strategy, such
   as Ring AllReduce, Tree AllReduce or gradient compression followed by
   aggregation, to reduce network bandwidth consumption and accelerate
   the synchronization of model parameters.  In the scenario of large
   model using Transformer or Attention structure, the computing layer
   can allocate model parameters to different resource pools to be
   updated in parallel with the help of block or pipeline parallel
   technology, and then centralized and summarized to the master node or
   master process.

   After the distributed parameter update is completed, the computing
   layer will send the final model weights or inference engine image
   back to the control layer to be registered in the model warehouse as
   the "latest version of the model", and may also synchronize some
   intermediate features or labels to the data layer for subsequent
   analysis.  At the same time, the fault tolerance and recovery module
   evaluates the stability and performance of the node according to the
   monitoring data collected during the training and update process, and
   provides a decision basis for the next iteration cycle or new task
   scheduling.

6.3.3.  Distributed unified Collaboration module

   The distributed unified collaboration module is located at the core
   of the entire computing layer, which is responsible for receiving and
   integrating task instructions (such as model segmentation strategy,
   training or inference goals, etc.) from the control layer, and
   effectively docking with the underlying computing power resource
   pool.  Its inputs include information about the architecture of the
   individual micromodels or aggregated models, the type of computation
   to be performed (training or inference), and an overview of the
   hardware available in the current cluster.  The output is a global
   choreography instruction for computing resources and computing
   processes, which is used to guide the computing power execution
   module and other functional modules to work together.  A distributed
   unified collaboration module will typically work with a service
   registry or cluster orchestration system (e.g., Kubernetes, Yarn), or
   may have a built-in distributed communication framework (e.g., NCCL,
   Horovod) to manage and synchronize multiple Gpus or multiple nodes.
   Its most prominent feature is that it can dynamically map different
   sub-models or operators to the most appropriate node according to the
   model block information and computing requirements, so that the
   distributed computing can maintain higher throughput and scalability
   in the multi-task and multi-model environment.






Yang, et al.            Expires 4 September 2025               [Page 21]

Internet-Draft              DMSC Architecture                 March 2025


6.3.4.  Load balancing and resource allocation mechanism

   The load balancing and resource allocation mechanism monitors the
   load of each computing resource pool (such as GPU cluster, CPU
   cluster, heterogeneous accelerator, etc.) in real time, and combines
   the task scheduling strategy given by the distributed unified
   collaboration module to decide how to distribute the computing load
   between nodes.  The input mainly includes the real-time status
   information of each node (free degree, free memory, computing power
   utilization) and the description of the hardware requirements of the
   task to be assigned (e.g., how many Gpus are needed, whether mixed-
   precision training is supported).  The output is the specific node
   allocation scheme and task routing instructions, which guide the
   computing power execution module to deliver computing tasks to the
   optimal location.

6.3.5.  Computing power execution module

   According to the instructions from the distributed unified
   cooperation module and the load balancing module, the computing power
   execution module loads the specific micro model or operator to the
   corresponding node to run.  The inputs include model parameters,
   network topology, and data blocks, and the outputs are computed
   inference results or intermediate training gradients.  The module can
   run on multiple servers through containerization (e.g., Docker,
   Kubernetes Pods), and combine with AI frameworks (TensorFlow,
   PyTorch, etc.) or self-developed inference engines to flexibly switch
   execution environment and underlying computing power.

6.3.6.  Data management module

   The necessary characteristics, tags and metadata information are
   transferred between the data management module and the control layer
   or the business layer.  The input sources usually include already
   chunked or segmented data sets, as well as intermediate results
   generated during model execution (e.g., local gradients, temporary
   features, etc.).  Output updated snapshots of model parameters, or
   preprocessed feature data, for later use.  The data management module
   can support high concurrent reads and writes with the help of
   distributed file system (HDFS), object storage (S3, etc.) or message
   queue (Kafka, RabbitMQ).  It also performs small-scale and high-
   frequency data queries with database or cache systems.









Yang, et al.            Expires 4 September 2025               [Page 22]

Internet-Draft              DMSC Architecture                 March 2025


6.3.7.  Fault tolerance and recovery module

   The fault tolerance and recovery module continuously monitors the
   heartbeat, load and network status of each node while the system is
   running.  Once an anomaly is detected, the fault information will be
   reported to the distributed unified cooperation module, and the
   automatic fault tolerance logic will be triggered.  The inputs are
   real-time cluster health data, task execution logs, and node failure
   reports.  The output is a series of decision instructions including
   restarting tasks, reallocating resources, or rolling back to the last
   stable snapshot.  This often includes self-healing from automation
   scripts (Ansible, salt, etc.) or cluster orchestration (Kubernetes),
   or it may include a stop-start training process that records the
   current iteration number and intermediate parameters when a crash
   occurs and waits for the node to recover before continuing execution.

6.3.8.  Computing resource pool

   A pool of computing resources represents a collection of underlying
   hardware that actually provides computing power.  Each pool may
   correspond to different types or specifications of hardware, such as
   GPU server farms, CPU clusters, FPGA/ASIC accelerator cards, or even
   hybrid computing power across cloud or local data centers.  Their
   inputs are usually task assignments and model execution requirements
   from load balancing and resource allocation mechanisms, and their
   outputs are inference results or training data after calculation, and
   relevant performance indicators (such as temperature, power
   consumption, throughput, etc.) are fed back to upper modules for
   analysis.

6.4.  Data layer

   The data layer is the backbone of distributed AI systems, enabling
   efficient data management while ensuring privacy protection,
   scalability, and seamless integration with other layers, including
   control, computing, and business layers.  It plays a pivotal role in
   storing, transmitting, and processing diverse datasets, supporting
   distributed training, inference, and model segmentation workflows.
   Through its robust design, the data layer balances security and
   performance while maintaining the flexibility required by dynamic,
   large-scale AI systems.










Yang, et al.            Expires 4 September 2025               [Page 23]

Internet-Draft              DMSC Architecture                 March 2025


6.4.1.  Privacy protection

   Privacy protection is at the core of the data layer, ensuring secure
   data handling across the entire AI workflow.  Multiple databases
   (e.g., DB1, DB2, ..., DBn) store datasets from various business
   domains or sensitivity levels, enabling the system to manage and
   segregate data efficiently.  For high-sensitivity scenarios, such as
   healthcare or financial applications, only encrypted or desensitized
   data fields are stored and transmitted.  For instance, patient
   medical records might be encrypted locally, and only aggregated
   gradients or anonymized insights are shared during federated learning
   tasks.

   When the system executes model training or inference, the control
   layer determines the appropriate data transmission strategy based on
   predefined privacy policies.  Federated learning ensures that raw
   data remains localized, sharing only intermediate model gradients or
   parameters, while differential privacy adds noise to data or
   computations to prevent individual information leakage.

   To further strengthen security, the data layer integrates advanced
   privacy-preserving technologies, such as homomorphic encryption,
   multi-party secure computation, and differential privacy injection.
   These techniques enable micro-models and segmented workflows to
   process data securely while complying with privacy regulations.  For
   instance, in a cross-database integration scenario, the data layer
   ensures that access control policies and metadata updates prevent
   unauthorized sharing of sensitive data, maintaining compliance
   without hindering system performance.

6.4.2.  Database maintenance and update

   The data layer's database infrastructure ensures reliable storage,
   high availability, and scalability, supporting the execution of
   micro-models and model segmentation workflows.  Distributed databases
   are deployed to manage datasets associated with various system
   segments, enabling parallel operations and efficient data
   provisioning for training and inference tasks.

   To handle high-concurrency environments, the data layer leverages
   distributed database architectures such as NoSQL, NewSQL, and
   relational databases, each selected based on the nature of the
   workload:

   NoSQL databases (e.g., HBase, Cassandra) are ideal for handling
   unstructured or semi-structured data, such as logs and user behavior
   data, offering high write throughput and horizontal scalability.




Yang, et al.            Expires 4 September 2025               [Page 24]

Internet-Draft              DMSC Architecture                 March 2025


   NewSQL systems (e.g., TiDB) provide a hybrid solution, balancing
   transactional consistency with scalability, making them suitable for
   workloads requiring real-time updates, such as model parameter
   synchronization.

   Relational databases (e.g., MySQL, PostgreSQL) handle structured
   datasets, such as model version histories or feature engineering
   outputs, ensuring strong consistency and query efficiency.

   The data layer ensures data consistency and fault tolerance through
   mechanisms such as master-slave replication, shard-based
   architectures, and automated failover.  For example, if a database
   shard responsible for storing training gradients becomes unavailable,
   the system redirects queries to backup replicas or initiates a
   failover process to restore service.  Regular incremental backups and
   disaster recovery protocols safeguard critical data against long-term
   loss due to network or hardware failures.

   Real-time monitoring tools, such as Prometheus and ELK Stack, track
   database performance metrics, including query latency,
   synchronization delays, and disk usage.  If anomalies are detected,
   automated alerts trigger recovery actions such as reallocating
   workloads, rerouting queries, or scaling database resources to
   prevent bottlenecks.  For instance, during a high-demand scenario
   like a shopping festival, the data layer may dynamically scale up
   storage resources to accommodate surging user activity logs, ensuring
   uninterrupted data availability for recommendation models.

7.  IANA Considerations

   TBD

8.  Acknowledgement

   TBD

9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

9.2.  Informative References

   [InfRef]   "", 2004.



Yang, et al.            Expires 4 September 2025               [Page 25]

Internet-Draft              DMSC Architecture                 March 2025


Appendix A.  An Appendix


Authors' Addresses

   Hui Yang
   Beijing University of Posts and Telecommunications
   10 Xitucheng Road, Haidian District
   Beijing
   Beijing, 100876
   China
   Email: yanghui@bupt.edu.cn


   Tiankuo Yu
   Beijing University of Posts and Telecommunications
   10 Xitucheng Road, Haidian District
   Beijing
   Beijing, 100876
   China
   Email: yutiankuo@bupt.edu.cn


   Qiuyan Yao
   Beijing University of Posts and Telecommunications
   10 Xitucheng Road, Haidian District
   Beijing
   Beijing, 100876
   China
   Email: yqy89716@bupt.edu.cn


   Zepeng Zhang
   Beijing University of Posts and Telecommunications
   10 Xitucheng Road, Haidian District
   Beijing
   Beijing, 100876
   China
   Email: 2024140574@bupt.cn












Yang, et al.            Expires 4 September 2025               [Page 26]