Alibaba Innovative Research (AIR) > System Software and Operation
Machine Learning for Memory Anti-Fragmentation

Theme

System Software and Ops Management

Topic

Machine Learning for Memory Anti-Fragmentation

Background

Machine learning (ML) techniques are rapidly expanding their adoptions in computer vision, natural language processing, and recommender system. It is believed that ML techniques will continue to benefiting a broader scope of real-world applications. Computer system is the new frontier for enjoying such benefits. Since 2018, ML for systems has motivated a series of high-impact research results, which cover data structure [1, 2], memory allocation [3], prefetching [4], etc.

 

It is not surprising that the memory hierarchy is among the hottest directions explored by ML techniques. Various forms of data-driven methods, heuristics, or hand-tuned parameters, have already long been used in different memory subsystems. This indicates that applying ML techniques to systems does not represent a fundamental departure from system research, but only provides a new set of tools. For example, ASPLOS’20 work [3] targets on reducing fragmentation in C++ workloads with huge pages, by adopting ML models for forecasting object lifetime and allocating memory based on predicted lifetime. Another recent OSDI’20 work [5] has applied ML techniques to infer SSD performance at per-IO granularity and improve SSD/NVMe latency. Moreover, ASPLOS’21 work [4] proposes a specifically designed neural model to capture address correlations, which is crucial for prefetching irregular sequences of memory accesses. These prior studies indicate a growing realization that a range of new subsystems up and down the memory hierarchy are needed to translate the academic promise of ML to real-world practice.

 

In this proposal, we target on reducing fragmentation when adopting Transparent Huge Page (THP) with practical ML techniques. This problem is critical in that, as the memory size is gradually increasing, the adoption of THP can bring multiple benefits, including reducing the overhead due to TLB miss, improving performance for virtualization, and facilitating PTE entry operations. However, memory fragmentation caused by application and system leads to THP not being fully utilized. Alleviating this fragmentation problem can effectively lower the cost while maintaining operation stability..

Target

  • A practical machine learning model

This model should be able to accurately identify memory fragmentation produced by both application and system in operation. By utilizing this model, the reduction of memory fragmentation for 2MB huge pages should be competitive to state-of-the-art results.

To successfully deploy ML models in production environment, the latency and resource consumption (cpu, memory, etc.) introduced by running ML models for inference should be limited. The model should be optimized to the extent that it is possible to deploy the developed model in real systems.

  • A standardized library which implements the prototype model.

The implementation code needs to cover data collection, data preprocessing, analyzing, model training, validation and inference.

The identification results produced by the model will be further used for performance diagnosis and memory allocation, the implementation needs to facilitate future use. 

  • A thorough report on evaluating the accuracy, time and space overhead of the implemented model.

 

Related Research Topics

  •  Effective and general memory fragmentation identification. The considered memory includes both the memory allocated to the target application and the memory allocated for system operation (e.g., slab).
  • Accelerating model inference to achieve ultra-low latency, limited memory space occupied and accurate identification.
  • Applying ML model in real systems without affecting the implementation of target applications.
  • Dynamic memory management. The ML model can be taken as a plug-in module that serves a broad range of needs for dynamic memory management, including memory allocation, anomaly detection, and performance diagnosis.

References

[1] Kraska, Tim, et al. "The case for learned index structures." Proceedings of the 2018 International Conference on Management of Data. 2018.

[2] Kraska, Tim, et al. "Sagedb: A learned database system." CIDR. 2019.

[3] Maas, Martin, et al. "Learning-based memory allocation for C++ server workloads." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.

[4] Shi, Zhan, et al. "A Hierarchical Neural Model of Data Prefetching." Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. 2021.

[5] Hao, Mingzhe, et al. "LinnOS: Predictability on Unpredictable Flash Storage with a Light Neural Network." 14th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI 20). 2020.

Suggested Collaboration Method

AIR (Alibaba Innovative Research), one-year collaboration project. 

Scan QR code
关注Ali TechnologyWechat Account