Publications

2025

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

Kun Wu*, Jeongmin Brian Park*, Xiaofan Zhang*, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, and Wen-mei Hwu

To Appear in Design Automation Conference 2025 2025

Abs arXiv Code Poster Slides

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations—the intermediate tensors produced during forward propagation and reused in backward propagation—dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. \rvMeanwhile, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.

2024

Code Generation and Runtime Techniques for Enabling Data Efficient Deep Learning Training on GPUs

Kun Wu

Ph.D. Dissertation University of Illinois at Urbana-Champaign 2024

Abs Paper Slides

As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. PyTorch-Direct significantly reduces CPU utilization, resulting in higher end-to-end training performance. For the input datasets and GNN architectures evaluated, PyTorch-Direct decreases the overall training time by up to 38.2%. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational graph neural networks (RGNNs). The performance challenges stem from RGNN’s inherent memory intensiveness, the gap between the programming interface and the kernel APIs, and the high kernel optimization cost due to the kernels’ coupling with layout and heterogeneity. Using a general matrix multiply (GEMM) template and a traversal template, Hector achieves up to a 43.7× speed-up in training and inference compared to the state-of-the-art systems. Linear operator reordering and compact tensor materialization further achieve up to 3.8× speed-up compared to the Hector unoptimized code. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Since activations take most of the GPU memory, SSDTrain offloads activations to Non-Volatile Memory Express (NVMe) SSDs with a direct GPU–SSD data path and good interoperability. The evaluation shows that SSDTrain reduces activations peak memory use by up to 47% with negligible overhead. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles. Together, these contributions demonstrate that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.
Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Qureshi, Scott Mahlke, and Wen-mei Hwu

2024

Abs arXiv

Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storagebased approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training
Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures

Kun Wu, Mert Hidayetoğlu, Xiang Song, Sitao Huang, Da Zheng, Israt Nisa, and Wen-mei Hwu

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 2024

Abs Paper

Relational graph neural networks (RGNNs) are graph neural networks with dedicated structures for modeling the different types of nodes and edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges: inherent memory-intensive computation patterns, the gap between the programming interface and kernel APIs, and heavy programming effort in optimizing kernels caused by their coupling with data layout and heterogeneity. To systematically address these challenges, we propose Hector, a novel two-level intermediate representation and its code generator framework, that (a) captures the key properties of RGNN models, and opportunities to reduce memory accesses in inter-operator scheduling and materialization, (b) generates code with flexible data access scheme to eliminate redundant data copies, (c) decouples model semantics, data layout, and operators-specific optimization from each other to reduce programming effort. By building on one general matrix multiply (GEMM) template and a node/edge traversal template, Hector achieves up to 9.9× speed-up in inference and 43.7× speed-up in training compared with the state-of-the-art public systems on select models, i.e., RGCN, RGAT and HGT, when running heterogeneous graphs provided by Deep Graph Library (DGL) and Open Graph Benchmark (OGB). In addition, Hector does not trigger any out-of-memory (OOM) exception in these tests. We also propose the linear operator reorder and compact materialization to further accelerate the system by up to 3.8×. As an indicator of programming effort reduction, Hector takes in 51 lines of code expressing the three models and generates a total of 8K lines of CUDA and C++ code. Through profiling, we found that higher memory efficiency allows Hector to accomodate larger input and, therefore, attains higher throughput in forward propagation, while backward propagation is bound by latency introduced by atomic updates and outer products.

2022

Graph Neural Network Training with Data Tiering

Seung Won Min, Kun Wu, Mert Hidayetoğlu, Jinjun Xiong, Xiang Song, and Wen-mei Hwu

SIGKDD International Conference on Knowledge Discovery and Data Mining 2022

Abs Paper

Graph Neural Networks (GNNs) have shown success in learning from graph-structured data, with applications to fraud detection, recommendation, and knowledge graph reasoning. However, training GNN efficiently is challenging because: 1) GPU memory capacity is limited and can be insufficient for large datasets, and 2) the graph-based data structure causes irregular data access patterns. In this work, we provide a method to statistically analyze and identify more frequently accessed data ahead of GNN training. Our data tiering method not only utilizes the structure of input graph, but also an insight gained from actual GNN training process to achieve a higher prediction result. With our data tiering method, we additionally provide a new data placement and access strategy to further minimize the CPU-GPU communication overhead. We also take into account of multi-GPU GNN training as well and we demonstrate the effectiveness of our strategy in a multi-GPU system. The evaluation results show that our work reduces CPU-GPU traffic by 87–95% and improves the training speed of GNN over the existing solutions by 1.6–2.1× on graphs with hundreds of millions of nodes and billions of edges.
SaintSN: Streamlined and Intelligent Storage Node System-on-a-Chip for Exascale Cluster

Acceptance rate: 17.6%. Pending US patent.

Kun Wu*, Dario Korolija*, Wen-mei Hwu, Gustavo Alonso, Sai Rahul Chalamalasetti, Dejan Milojicic, and Lance Evans

Proceedings of the Hewlett Packard Enterprise Technical Conference 2022

2021

PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow

Sitao Huang, Kun Wu, Hyunmin Jeong, Chengyue Wang, Deming Chen, and Wen-Mei Hwu

IEEE Transactions on Computers 2021

Abs Paper Code

The exploding complexity and computation efficiency requirements of applications are stimulating a strong demand for hardware acceleration with heterogeneous platforms such as FPGAs. However, a high-quality FPGA design is very hard to create and optimize as it requires FPGA expertise and a long design iteration time. In contrast, software applications are typically developed in a short development cycle, with high-level languages like Python, which have much higher levels of abstraction than all existing hardware design flows. To close this gap between hardware design flows and software applications, and simplify FPGA programming, we create PyLog, a high-level, algorithm-centric Python-based programming and synthesis flow for FPGA. PyLog is powered by a set of compiler optimization passes and a type inference system to generate high-quality hardware design. It abstracts away the implementation details, and allows designers to focus on algorithm specification. PyLog takes in Python functions, generates PyLog intermediate representation (PyLog IR), performs several optimization passes, including pragma insertion, design space exploration, and memory customization, etc., and creates complete FPGA system designs. PyLog also has a runtime that allows users to run the PyLog code directly on the target FPGA platform without any extra code development. The whole design flow is automated. Evaluation shows that PyLog significantly improves FPGA design productivity and generates highly efficient FPGA designs that outperform highly optimized CPU implementation and state-of-the-art FPGA implementation by 3.17× and 1.24× on average.
Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Upstreamed to DGL.

Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu

Proceedings of the VLDB Endowment 2021

Abs Paper Code

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs.In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65–92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.
PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses

Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu

arXiv preprint 2021

Abs arXiv Code

With the increasing adoption of graph neural networks (GNNs) in the machine learning community, GPUs have become an essential tool to accelerate GNN training. However, training GNNs on very large graphs that do not fit in GPU memory is still a challenging task. Unlike conventional neural networks, mini-batching input samples in GNNs requires complicated tasks such as traversing neighboring nodes and gathering their feature values. While this process accounts for a significant portion of the training time, we find existing GNN implementations using popular deep neural network (DNN) libraries such as PyTorch are limited to a CPU-centric approach for the entire data preparation step. This "all-in-CPU" approach has negative impact on the overall GNN training performance as it over-utilizes CPU resources and hinders GPU acceleration of GNN training. To overcome such limitations, we introduce PyTorch-Direct, which enables a GPU-centric data accessing paradigm for GNN training. In PyTorch-Direct, GPUs are capable of efficiently accessing complicated data structures in host memory directly without CPU intervention. Our microbenchmark and end-to-end GNN training results show that PyTorch-Direct reduces data transfer time by 47.1% on average and speeds up GNN training by up to 1.6x. Furthermore, by reducing CPU utilization, PyTorch-Direct also saves system power by 12.4% to 17.5% during training. To minimize programmer effort, we introduce a new "unified tensor" type along with necessary changes to the PyTorch memory allocator, dispatch logic, and placement rules. As a result, users need to change at most two lines of their PyTorch GNN training code for each tensor object to take advantage of PyTorch-Direct.
A Python-based High-Level Programming Flow for CPU-FPGA Heterogeneous Systems : (Invited Paper)

Sitao Huang, Kun Wu, Sai Rahul Chalamalasetti, Izzat El Hajj, Cong Xu, Paolo Faraboschi, and Deming Chen

2021 IEEE/ACM Programming Environments for Heterogeneous Computing (PEHC) 2021

Paper Code
TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-Aware Datatypes

Carl Pearson, Kun Wu, I-Hsin Chung, Jinjun Xiong, and Wen-Mei Hwu

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing 2021

Abs Paper Code

MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242000x and MPI_Send speedup of up to 59000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes.

2019

Memory-Bound Proof-of-Work Acceleration for Blockchain Applications

Kun Wu, Guohao Dai, Xing Hu, Shuangchen Li, Xinfeng Xie, Yu Wang, and Yuan Xie

Proceedings of the 56th Annual Design Automation Conference 2019

Abs

Blockchain applications have shown huge potential in various domains. Proof of Work (PoW) is the key procedure in blockchain applications, which exhibits the memory-bound characteristic and hinders the performance improvement of blockchain accelerators. In order to mitigate the "memory wall" and improve the performance of memory-hard PoW accelerators, using Ethash as an example, we optimize the memory architecture from two perspectives: 1) Hiding memory latency. We propose specialized context switch design to overcome the uncertain cycles of repetitive memory requests. 2) Increasing memory bandwidth utilization. We introduce on-chip memory that stores a portion of the Ethash directed acyclic graph (DAG) for larger effective memory bandwidth, and further propose adopting embedded NOR flash to fulfill the role. Then, we conduct extensive experiments to explore the design space of our optimized memory architecture for Ethash, including number of hash cores, on-chip/off-chip memory technologies and specifications. Based on the design space exploration, we finally provide the guidance for designing the memory-bound PoW accelerator. The experiment results show that our optimized designs achieve 8.7% – 55% higher hash rate and 17% – 120% higher hash rate per Joule compared with the baseline design in different configurations.