Dates: 10/01/23 – 09/30/26
Award Amount: $240,000.00
Award #: 2327509
PI: Emina Soljanin
Artificial intelligence and machine learning algorithms rely on parallel, distributed computing systems to efficiently carry out intricate, data-heavy tasks. A significant challenge in designing large-scale distributed computing systems is addressing the unpredictable variations in service times across multiple servers. Computing redundancy, such as task replication, is a promising powerful tool to curtail the overall variability in service time. This project focuses on the intelligent management of redundancy in distributed computing that will affect the execution efficiency of data-intensive algorithms in large-scale systems. The project will quantify redundancy benefits, pivotal to developing and ultimately deploying efficient redundancy schemes for executing artificial intelligence and machine learning workloads. The educational goal of the project includes stimulating students’ interest in applied probability and mathematical modeling and developing hands-on labs on cloud computing infrastructure. The project will contribute to the Research Experiences for Undergraduate and High School students and will recruit and mentor women and members of underrepresented groups.
This project considers distributed computing systems that use replication and erasure coding to reduce job execution times. The project aims to maximize the gain of using computing redundancy (coding gain) in practical scenarios. It complements recent work on redundancy in distributed systems, focusing primarily on designing redundancy schemes using erasure codes. The project will use statistical analysis and queueing and coding theories to make the following contributions: (i) characterization of the crucial effects of using redundancy in distributed computing, including analysis of the benefits and costs of redundancy; (ii) new mathematical models that capture the performance of distributed computing systems with stragglers; (iii) new analysis tools for computing coding gain in coded computing systems; (iv) development of redundancy management algorithms; (v) characterization of the diversity vs. parallelism trade-off; and (vi) addressing other critical issues in coded computing that do not exist in the better-understood replication solutions.
This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.