Alibaba Innovative Research (AIR) > Big Data & Data Mining
Model and Schedule Large Scale Fine-Grained CPU/GPU Hybrid DAG

Background

Mars is an open-source, tensor-based, distributed scientific compute engine developed by Alibaba MaxCompute team, which is designed to handle super large-scale scientific compute problems. In short, Mars is an engine for both big-data and scientific computation.

Mars uses tensors and operands to construct its DAG in expression layer, and tile tensor into chunks in runtime layer for the purpose of 1) distributing computation 2) fitting in memory. Therefore, the DAG in runtime is a much more fine-grained graph with tens of thousands chunks and operands than traditional distributed compute engines.

To schedule such a large DAG effectively is quite different from other big data systems such as Spark or Flink. Since a Mars DAG might have no stage barriers (comparing to Map-Reduce) at all, a great proportion of network IO, memory copy as well as serializing/deserializing CPU cost could be eliminated by clustering as many chunks and operands as possible into one runtime execution node. However, heterogeneous hardware is inevitable in the real world of big data system. A slow node, no matter due to its hardware specification or its health status, could lead the execution of DAG to long-tail symptom. The DAG grows bigger, the long-tail symptom becomes more serious. What’s more, execution strategy plays a crucial role when running a Mars DAG. With the absence of stage barriers, an optimal execution order could possibly reduce amount of data stored in the cluster and thus minimize amount of disk IO.

We believe that modeling and optimizing execution of such find-grained DAGs is one of the most challenging work nowadays. Once we achieve a good model of optimization on heterogeneous system, we can evolve Mars to handle CPU/GPU hybrid scenario, which is a key advantage for a computation massive system like Mars.

 

Target

  • A model to estimate cost of executing a DAG node on heterogeneous hardware
  • To test and verify how much GPU could accelerates a CPU cluster in computation intensive workloads provided by Mars

Related Research Topics

  • Yu, Jia, Rajkumar Buyya, and Kotagiri Ramamohanarao. "Workflow scheduling algorithms for grid computing." In Metaheuristics for scheduling in distributed computing environments, pp. 173-214. Springer, Berlin, Heidelberg, 2008.
  • Qian, Zhengping, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, and Zheng Zhang. "MadLINQ: large-scale distributed matrix computation for the cloud." In Proceedings of the 7th ACM european conference on Computer Systems, pp. 197-210. ACM, 2012.
  • Ousterhout, Kay, Patrick Wendell, Matei Zaharia, and Ion Stoica. "Sparrow: distributed, low latency scheduling." In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69-84. ACM, 2013.
  • Benson, Austin R., David F. Gleich, and James Demmel. "Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures." In 2013 IEEE international conference on big data, pp. 264-272. IEEE, 2013.
  • Boutin, Eric, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. "Apollo: scalable and coordinated scheduling for cloud-scale computing." In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pp. 285-300. 2014.
  • Huang, Chien-Chin, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. "Spartan: A distributed array framework with smart tiling." In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15), pp. 1-15. 2015.
  • Peng, Boyang, Mohammad Hosseini, Zhihao Hong, Reza Farivar, and Roy Campbell. "R-storm: Resource-aware scheduling in storm." In Proceedings of the 16th Annual Middleware Conference, pp. 149-161. ACM, 2015.1.    
  • Zhang, Mingxing, Yongwei Wu, Kang Chen, Teng Ma, and Weimin Zheng. "Measuring and optimizing distributed array programs." Proceedings of the VLDB Endowment 9, no. 12 (2016): 912-923.
  • Moritz, Philipp, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol et al. "Ray: A distributed framework for emerging {AI} applications." In 13th {USENIX} Symposium on2.       
  • Operating Systems Design and Implementation ({OSDI} 18), pp. 561-577. 2018.

Scan QR code
关注Ali TechnologyWechat Account