DeepSeek Open Source Week: A carnival of technology inclusion, the magic of improving the efficiency of large models
Key takeaway
DeepSeek Open Source Week: A Carnival of Technology Inclusion, The Magic of Improving Large Model Efficiency highlights the groundbreaking impact of DeepSeek’s 2025 open-source initiative. Launched in late February 2025, DeepSeek unveiled transformative innovations like FlashMLA, DeepEP, DeepGEMM, DualPipe, EPLB, and 3FS during its Open Source Week, significantly reducing generative AI costs and enhancing efficiency. Building on the Transformer model, DeepSeek optimizes matrix operations, slashes computing demands, and enables deployment on modest hardware like the Nvidia H800 GPU. Compared to ChatGPT’s resource-heavy approach, DeepSeek offers superior cost-effectiveness and openness, democratizing AI access. While speculation about computers gaining emotions persists, the article clarifies that DeepSeek’s advancements are rooted in advanced algorithms, not sentience. This carnival of technology inclusion promises to boost societal productivity and reshape the AI landscape.
DeepSeek Open Source Week: A Carnival of Technology Inclusion
At the dawn of 2025, the release of DeepSeek has sent shockwaves throughout the tech world. This isn’t just another generative AI product—it’s a disruptive force that has drastically lowered the R&D and operational costs of AI models. Suddenly, generative AI isn’t reserved for a select few tech giants; it’s within reach for households, enterprises, and even government bodies looking to turbocharge efficiency. In short, DeepSeek is not just improving work efficiency—it’s reshaping the landscape of AI innovation.
The Hype: From ChatGPT to DeepSeek
In 2022, OpenAI’s ChatGPT was the poster child of generative AI, touted as a revolutionary tool for boosting productivity. Fast forward to 2025, and DeepSeek—a domestic innovation—claims superiority in cost, capability, and openness. Some even go so far as to suggest that DeepSeek might imbue computers with “emotions” or personality, foreshadowing a future where machines could dominate or even replace human roles.
But let’s get one thing straight: whether it’s DeepSeek or any other generative AI, at their core, these systems are sophisticated algorithms converting human language into vectors—essentially strings of numbers—then transforming those back into output through matrix multiplications. No magic, just complex math.
Breaking Down the Innovations
DeepSeek’s Open Source Week, held in the last week of February 2025, unveiled a suite of innovations aimed squarely at addressing the bottlenecks in large model training and inference. Here’s a detailed look at what they brought to the table:
1. FlashMLA: The Language Parsing Accelerator
Generative AI begins with language. The Transformer model, the backbone of today’s AI, uses the “Attention” mechanism—encoding words into 512-number vectors, then processing them through Q, K, and V matrices. As text length increases, the computation grows quadratically, and GPU memory strains under the weight of storing these matrices.
FlashMLA tackles this head-on by compressing the K and V matrices. By pruning insignificant numbers and optimizing data for Nvidia’s H800 GPUs, FlashMLA cuts down on memory use and data transfer delays. With its help, a single H800 card’s FP8 performance has surged from 300T FLOPS to an astounding 580T FLOPS, pushing GPU memory bandwidth usage to 90% of the theoretical limit.
2. DeepGEMM: The Cornerstone of Efficiency
Matrix multiplication is the lifeblood of AI—every Transformer model leans on it. Traditional GEMM operations, implemented via Nvidia’s cuBLAS or NVBLAS, work well until you hit the performance ceiling of 8-bit computations.
DeepGEMM revolutionizes this process by leveraging Nvidia Hopper’s tensor cores. It temporarily stores intermediate values as 32-bit floats to preserve accuracy while maintaining speeds akin to 8-bit operations. In bypassing standard libraries and directly writing machine instructions, DeepGEMM challenges Nvidia’s proprietary CUDA ecosystem. Think of it as the evolution from steam engines to internal combustion engines, now propelling us into an era of universally accessible AI.
3. EPLB and DualPipe: Efficiency Masters
Remember how Taylor’s scientific management and Ford’s production lines redefined industrial efficiency? DeepSeek channels that spirit with EPLB and DualPipe.
- EPLB organizes expert models (MoE) on the same GPU to reduce inter-card communication. When some experts are in high demand, additional copies are deployed to handle the load—much like having more doctors on call in a busy hospital department.
- DualPipe minimizes pipeline “bubbles” during training. By interlacing forward computation with backward verification, it ensures that while one part of the process waits, another is hard at work. This clever scheduling keeps GPUs consistently productive, echoing the efficiency of a finely tuned assembly line.
4. DeepEP: Shattering Communication Barriers
DeepEP reimagines the traditional AllReduce operation (a critical component for aggregating distributed computations). On H800 GPUs, limited by communication bandwidth, DeepEP reassigns tasks dynamically within GPU cores to mitigate data transfer delays. By acting as a bridge across NVLink and RDMA networks, it shatters the barriers imposed by legacy communication libraries like NCCL—essentially democratizing high-performance computing.
5. 3FS: The Parallel File System Renaissance
In the era of large-scale AI, storage isn’t just an afterthought—it’s a critical pillar. 3FS (Fire-Flyer File System), DeepSeek’s answer to the storage conundrum, leverages RDMA networks and SSDs to create a high-throughput, scalable distributed file system. Imagine a system that can reach 6.6TiB/s on a cluster of 180 nodes—this is 3FS, pushing storage performance to over 80% of the theoretical network throughput.
Here’s a quick summary table of the core innovations:
Innovation | Key Function | Impact |
---|---|---|
FlashMLA | Compresses K/V matrices for language parsing | Boosts GPU FP8 performance, reduces memory usage |
DeepGEMM | Optimizes matrix multiplication | Challenges CUDA, enhances computation speed |
EPLB | Reduces inter-GPU communication in MoE models | Enhances efficiency of expert model collaboration |
DualPipe | Minimizes pipeline bubbles during training | Increases overall GPU utilization |
DeepEP | Customizes AllReduce for limited-bandwidth GPUs | Breaks communication barriers |
3FS | High-performance parallel file system | Maximizes storage throughput |
Why is DeepSeek the Only Game-Changer?
The answer lies in its relentless pursuit of efficiency. DeepSeek runs on Nvidia H800 GPUs, which have limited communication bandwidth and a relatively modest count (2048 units), compared to industry giants. By meticulously optimizing every layer—from the Transformer model down to the hardware-level operations—DeepSeek has managed to scale performance to levels comparable to setups with 10,000 GPU cards. This isn’t about having infinite resources; it’s about smarter resource allocation.
DeepSeek’s open-source ethos further amplifies its impact. By releasing these innovations to the public, DeepSeek not only challenges the status quo (and Nvidia’s “moat” around CUDA) but also democratizes generative AI. This is not just a technological feat—it’s a social statement, echoing the timeless sentiment: true progress is measured by how much it uplifts society as a whole.
Final Thoughts: A Catalyst for Universal AI
DeepSeek Open Source Week is more than a tech release—it’s a carnival of technological inclusion. By slicing through the inefficiencies of large model training and inference, DeepSeek is making generative AI accessible to a wider audience. And while some naysayers may dismiss this as hype, the numbers and innovations speak for themselves.
If history has taught us anything, it’s that groundbreaking innovation isn’t about guarding secrets—it’s about unleashing potential for the common good. DeepSeek’s bold step to open-source its core technologies is set to redefine the future of AI, driving not just industrial productivity but also societal progress.
As we stand on the brink of a new era in AI, one thing is clear: the future belongs to those who innovate relentlessly and share their breakthroughs. DeepSeek is not merely a tool; it’s a movement—one that promises to turn the tide in favor of efficiency, inclusion, and transformative progress.
DeepSeek Open Source Week: A carnival of technology inclusion, the magic of improving the efficiency of large models
At the beginning of 2025, DeepSeekThe release of DeepSeek has caused a sensation in the whole society. This is because DeepSeek has greatly reduced the R&D and use costs of generative AI through a series of technological innovations, making it possible for generative AI to enter thousands of households in the near future, thereby helping the whole society to improve work efficiency.
In 2022, ChatGPT developed by OpenAI is considered to be a revolutionary generative AI tool that can help users improve work efficiency; in 2025, DeepSeek, a domestic generative AI, is considered to be a better generative AI tool than ChatGPT in terms of cost, capability and openness, and has affected the financial market to a certain extent. There is even a view in society that DeepSeek can give computers emotions and other elements that constitute personality, and even conclude that computers will soon dominate or even replace humans.
In fact, readers with a basic knowledge of computers and mathematics can easily understand that whether it is DeepSeek or other generative AI, the "understanding" and "generation" of human language is actually just to convert it into a string of numbers (called "vectors" in computer science) through a certain algorithm, and then convert it into output content through a series of algorithms. In this process, the computer does not have the "emotions" and "personality" that are unique to humans, but only solves a slightly complex mathematical calculation problem.
Obviously, converting human language into "vectors" and then generating output content requires very complex computer algorithms. It is unrealistic for humans to write this algorithm entirely. Therefore, engineers simplify these algorithms into a series of matrix multiplications (the so-called "models" and "parameters"), and let the computer use brute force to try the values of these matrices (that is, "adjusting parameters" or "training"), and finally get the published "model". After obtaining the "model", let the computer get the output content based on this "model" and the user's input, which is the so-called "inference".
At present, all generative AIs are based on the "Transformer" model implemented by Google's paper "Attention is All You Need" published in 2017. The Transformer model has incubated a series of generative AIs represented by ChatGPT, but the massive computing resources required for its training and reasoning are difficult for individual users and general corporate users to accept. In particular, even large government and enterprise users with strong financial strength may be affected by the complex external environment and encounter difficulties in purchasing hardware for high-performance clusters. This has become a major obstacle to using generative AI to help the whole society improve production efficiency.
The emergence of DeepSeek changed all that. In the past week, DeepSeek has publicly released a series of very valuable innovative results for developers. So, what key innovations did it achieve? In order to systematically interpret the contributions and breakthroughs of DeepSeek Open Source Week, I hope to present the value and impact of these innovations on the industry in as simple language as possible.
1. DeepSeek Open Source Week, what important content was open sourced?
DeepSeek has chosen the last week of February 2025 as "Open Source Week" and announced the Flash MLA(optimizing the statement parsing algorithm), DeepEP (optimizing the multi-machine collaboration mechanism), DeepGEMM (improving the efficiency of matrix multiplication), DualPipe (a means of squeezing computer resources), EPLB (realizing the generation of content in different fields) and 3FS (high-performance storage). At the same time, DeepSeek also disclosed some analytical data during the research and development process.
I mentioned at the beginning of the article that whether it is DeepSeek, ChatGPT, or other generative AI, they are essentially letting computers perform a series of matrix operations. So, if you want to improve the execution efficiency of generative AI algorithms, you should start from three aspects: reduce the size of the matrix, improve the efficiency of calculations, and reduce waiting time. In the past exciting week, these core technologies announced by DeepSeek all focused on these three aspects.
1. FlashMLA: Language parsing accelerator
I know that the input of generative AI is generally human natural language. In the Transformer model, the mechanism for encoding and analyzing natural language is the so-called "Attention" mechanism - first encode each word into a "vector" composed of 512 numbers, and then use the three matrices Q, K, and V to perform association analysis on each word and all other words in the full text. Obviously, as the input length increases, the total amount of calculation will increase at a square law. At the same time, it will also consume precious GPU memory to save the K and V matrices of each word in the entire sentence.
FlashMLA's solution to this problem is to try to compress the two matrices K and V, for example, by removing relatively small numbers in the matrix and some numbers that are 0, to save memory and reduce the workload. Furthermore, FlashMLA is also optimized for Nvidia's H800 GPU. Considering the communication bandwidth limitation between H800 cards, it reduces the need to read and write data on other cards, avoiding the limitation of computing performance by the communication bandwidth between cards.
With the support of FlashMLA, the FP8 computing performance of a single H800 card has been increased from 300T FLOPS to 580T FLOPS, and the memory bandwidth has been squeezed to 90% of the theoretical limit. So, how can the computing performance be further improved?
2. DeepGEMM: The cornerstone of AI for a new era
Almost all AI calculations are inseparable from matrix multiplication. Since matrix multiplication can be decomposed into multiple groups of repeated calculations without interdependence, engineers defined the GEMM (GEneral Matrix Multiply) operator, and Nvidia also implemented the parallel operation of this operator based on its own GPU in the two math libraries cuBLAS and NVBLAS. It can be said that GEMM is the cornerstone of all AI algorithms, including the Transformer model, and its importance is comparable to the significance of power devices to mechanized and industrialized systems.
DeepSeek has made revolutionary optimizations to GEMM. Taking into account that the Tensor Core (dedicated matrix operation circuit) inside the Nvidia Hopper series GPU can support 8-bit floating point calculations, but the accuracy is not as good as 16-bit and 32-bit floating point numbers, DeepGEMM temporarily stores the intermediate process as 32-bit floating point numbers to improve accuracy, while the calculation speed is almost the same as 8-bit.
It is worth noting that to achieve this operation, it is necessary to bypass all existing development libraries and directly write machine instructions. Its open source behavior actually poses a direct challenge to Nvidia's "moat" - the CUDA ecosystem. I can even think that the significance of DeepGEMM, just like the iteration of steam engine-internal combustion engine-electric motor driving the industrial revolution, will drive generative AI technology into an era of universal benefit, thus becoming the cornerstone of cross-era AI.
3. EPLB and DualPipe: Efficiency Masters Driving the Industrial Revolution
When analyzing the reasons for the rise of the United States, social scientists often mention the concepts of Taylor management and Ford production lines that appeared in the United States in the early 20th century. In industrial production, Taylor management allows each worker to give full play to his or her expertise, while Ford production lines prevent people from wasting their working time waiting. The role played by EPLB and DualPipe in DeepSeek is similar to that of Taylor management and Ford production lines.
One of the core technologies of generative AI is the so-called "expert model". Its working principle is to input the results of the computer's understanding of natural language into the matrix that describes the "expert model", and after a series of matrix multiplication operations, the generated answer is obtained. In order to make the expert model achieve better results on the H800 GPU cluster with limited communication capabilities, DeepSeek uses the MoE model, which is to use multiple small expert models focusing on specific fields to generate content - this is similar to different departments in a hospital, where the most suitable experts are allowed to diagnose and treat after initially identifying the patient's problem.
In a hospital, different departments may have division of labor and cooperation, and the busyness of each department may vary greatly. So, if closely cooperating departments are arranged on the same floor, and more experts are hired for busy departments, the waiting time for patients can be effectively reduced. EPLB was designed based on this idea, placing several expert models that interact frequently on the same GPU to reduce inter-card communication. At the same time, if it is found that some expert models are called significantly more than other expert models, more copies of these expert models are made to handle concurrent calculations.
The role of DualPipe is to learn from the improvement ideas of Ford's production line and minimize the waiting time of each link of the pipeline when training these expert models (the so-called "pipeline bubble"). The idea is to arrange the two tasks in an interlaced manner. When the next computing task is waiting for the communication task to end, let the computer perform other tasks first. Specifically, the link of solving equations (the so-called "forward calculation") and the link of verification feedback (the so-called "backward calculation") in the training process share a pipeline. In this way, while the computing task of solving the equation link waits for the communication task to be completed before it can continue to execute, the GPU is allowed to perform calculations in the verification feedback link, and vice versa.
DeepSeek introduced EPLB and DualPipe in the engineering work of AI large model training and inference algorithms. Its contribution to the industry can be compared to the contribution of Taylor management system and Ford production line to industrial production, which further liberates and develops productivity.
4. DeepEP: A breaker of the barriers in the field of deep science
Readers who have watched the movie "Out of the Blue" will definitely remember the plot of "the abacus made the atomic bomb". Under the constraints of the lack of large computers, my country's scientific and technological workers disassembled the complex nuclear physics simulation calculations into distributed parallel computing tasks, mobilized thousands of workers, and used abacus to complete multiple rounds of simulation calculations, which ultimately supported the successful development of the atomic bomb.
In multiple rounds of simulation calculations, an important task is to collect and summarize the results of everyone's calculations as input for the next round of calculations. In the training algorithm of the AI model, this task is called "AllReduce". Before DeepSeek open-sourced DeepEP, this task needed to rely on NCCL (Nvidia Collective Communications Library) developed by Nvidia.
DeepEP is actually a deep customization and optimization of traditional AllReduce. First, because DeepSeek's GPU is the H800 with limited communication bandwidth, DeepEP tries to limit the consumption of inter-card communication resources, allowing some GPUs to act as relay nodes to merge and process, and then transmit the merged calculation results to other GPUs to avoid unnecessary communication overhead.
Considering that when the GPU switches to the AllReduce task while executing the equation solving task, it needs to reload the instructions and data into the cache, DeepEP also adds a mechanism to allow some processing cores (SM, Streaming Multiprocessor) in the GPU to handle this task specifically and dynamically adjust the number of cores that undertake the AllReduce task.
I can see that DeepEP breaks three major barriers: first, the barriers of the NVLink network (inside the server) and the RDMA network (between servers) for GPU communication; second, the barriers of SM division of labor within the GPU; and most importantly, DeepEP breaks technical barriers such as the collective communication library monopolized by Nvidia by directly programming the hardware.
DeepEP is a breaker of the barriers in the field of precision science, just like the scientific and technological workers who created an atomic bomb with an abacus in the Gobi Desert.
5. FS: If you are afraid of overflowing, think of the rivers and seas that flow down hundreds of rivers.
Computing, networking and storage are the three basic pillars of computer systems. The open source of 3FS also fills the last piece of the puzzle of the large-scale distributed system used by DeepSeek.
Machine learning algorithms represented by generative AI are essentially massive matrix operations. During the operation, drafts (also known as "checkpoints") need to be saved frequently. When thousands of GPU cards save checkpoint data in parallel, the performance of the storage subsystem is severely tested. Therefore, the so-called "parallel high-performance file system" has emerged in the industry, which uses multiple servers to share the task of storing data, that is, distributed storage.
One of the most important issues that distributed systems need to solve is to ensure that the key performance of the system can grow in direct proportion to the number of servers, especially to avoid multiple parallel tasks being blocked at a single point. In particular, in order to ensure that key data is not lost, it is necessary to ensure that a piece of data can be written to multiple redundant storage media and the data content remains consistent.
At present, there are open source parallel file systems such as LustreFS in the industry, but there is still much room for improvement in performance, high availability and consistency. 3FS (Fire-Flyer File System) developed by DeepSeek is based on the idea of software and hardware collaboration, and uses RDMA network and SSD storage disk to implement its own high-performance parallel file system. Since RDMA network and SSD storage disk can work together through NVMe protocol, and RDMA network can bypass the interrupt processing of remote CPU and directly read data in remote SSD/memory, 3FS can achieve 6.6TiB/s throughput performance on a cluster of 180 storage nodes, squeezing the throughput of parallel file system to more than 80% of the theoretical value of network throughput, which is a remarkable achievement. This reminds us of a sentence by Wei Zheng, a politician in the early Tang Dynasty, in "Ten Thoughts of Admonition to Emperor Taizong": If you are afraid of overflowing, think about the rivers and seas that contain hundreds of rivers. - If you are worried that the storage system will become a bottleneck, you should let it accommodate water from many rivers like the rivers and seas.
2. Why is DeepSeek the only one that can do this?
Readers who have some knowledge of machine learning and mathematics can easily understand that the technologies that DeepSeek has open-sourced this week are not very difficult to implement. So why is DeepSeek the only one to do this?
From public information, I can learn that the GPU used for training by DeepSeek is the H800, which has limited communication bandwidth between GPU cards, and the number is only 2048, which is an order of magnitude lower than that of the leading large enterprises in the industry. The number of GPU resources required to deploy the DeepSeek inference model can be reduced to the point where a single consumer-grade PC can run it. What DeepSeek does is to optimize the Transformer model, reduce waste, and enable it to run smoothly on limited hardware resources.
Since the GPUs DeepSeek has are of a bandwidth-limited version, DeepSeek's improvements to this are FlashMLA, EPLB, and DeepEP. At the same time, under the pressure of DeepGEMM and DualPipe, DeepSeek uses 2048 cards to achieve the training effect of a 10,000-card cluster. Finally, 3FS further reduces the storage overhead during training.
Some people may ask, what is the motivation for the DeepSeek team to make the results of their hard work public to the whole society without reservation? We might as well turn our attention to the sea of books in the humanities and social sciences to seek possible answers.
In his Theses on Feuerbach, Marx pointed out that "the essence of man is the sum of social relations." More than 100 years after Marx made this conclusion, American sociologist Maslow proposed that the highest level of human needs is the need for self-realization, followed by the need for respect. In "In Memory of Norman Bethune," written almost at the same time as Maslow, there is also such a conclusion: "We all need to learn from his spirit of selflessness. Starting from this point, we can become people who are of great benefit to the people."
Therefore, I have reason to believe that DeepSeek's contribution of its work results to the whole society free of charge shows that this is a team that is detached from short-term gains and losses and other low-level interests, a team that strives for self-achievement and respect, and a team that actively gives back to society.
DeepSeek stands on the shoulders of Transformer to make generative AI technology accessible to the general public, and at the same time, it will open source its own developed technology to give back to society. If this positive cycle continues, the whole society will not only benefit from the productivity improvement based on AI technology, but also through the spread of this value, everyone can better unite and strive for a common goal, making the future of the world better.
Comments
Post a Comment