Environment-Aware Failure Propagation Simulator Using Reinforcement Agents
Keywords:
failure propagation, reinforcement learning, chaos engineering, misconfiguration detection, root cause analysisAbstract
The objective of this paper is to present an environment-aware failure propagation simulator, reinforcement learning agents which is used to simulate chaotic development and QA clusters. This system explores pipeline misconfigurations that randomly pass promotion gates and agents may choose high-risk setups by adjusting failure distributions and learning from errors.
Downloads
References
J. Oppenheimer, M. Ganapathi, and D. A. Patterson, "Why do internet services fail, and what can be done about it?," USENIX Symposium on Internet Technologies and Systems, 2003.
C. E. Killian, I. Cohen, J. Kline, and J. S. Chase, "Discovery of network failures using correlation analysis," ACM SIGCOMM Computer Communication Review, vol. 35, no. 4, pp. 231-242, Oct. 2005.
N. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety, MIT Press, 2011.
C. E. Brodie and J. M. C. Chen, "On the evaluation of failure propagation in distributed systems," IEEE Transactions on Reliability, vol. 60, no. 1, pp. 10-19, Mar. 2011.
T. Chen, R. Bertram, and A. E. C. Rocha, "Towards fault injection testing in cloud environments," IEEE International Conference on Cloud Computing, pp. 362-369, 2012.
J. Allspaw, "DevOps: A Software Architect’s Perspective," IEEE Software, vol. 30, no. 3, pp. 77-79, May-June 2013.
C. Rosenthal, S. Hariri, and M. J. Lyons, "Failure detection and diagnosis in distributed systems: A survey," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 4, pp. 1224-1238, Apr. 2014.
K. Beznosov and R. Sasse, "Building user trust in automated root cause analysis systems," Proceedings of the 2015 ACM Conference on Computer Supported Cooperative Work, pp. 1224-1233, 2015.
N. Guo et al., "Learning-based fault diagnosis for cloud systems: A survey," IEEE Transactions on Services Computing, vol. 9, no. 4, pp. 697-712, July-Aug. 2016.
G. E. P. Box, "Robustness in the strategy of scientific model building," Robustness in Statistics, Academic Press, 1979.
K. M. Murphy, "Machine learning: A probabilistic perspective," MIT Press, 2012.
M. G. De Vries, A. S. Tanenbaum, and R. van Renesse, "Fault tolerance in distributed systems," Communications of the ACM, vol. 62, no. 5, pp. 62-73, May 2016.
J. Oden and J. Ritter, "Chaos engineering: Building confidence in system behavior through controlled experiments," IEEE Software, vol. 34, no. 6, pp. 30-37, Nov.-Dec. 2017.
M. F. Armbrust et al., "A view of cloud computing," Communications of the ACM, vol. 53, no. 4, pp. 50-58, Apr. 2010.
S. M. Lundberg and S. Lee, "A unified approach to interpreting model predictions," Advances in Neural Information Processing Systems, vol. 30, pp. 4765-4774, 2017.
R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
J. Dean et al., "Large scale distributed deep networks," Advances in Neural Information Processing Systems, vol. 25, pp. 1223-1231, 2012.
K. H. Kim, "Automated root cause analysis framework for large-scale distributed systems," IEEE International Conference on Distributed Computing Systems, pp. 257-266, 2018.
C. L. Chen, R. L. Bagrodia, and M. S. Hamdi, "On scalable fault injection and failure recovery techniques for distributed systems," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 5, pp. 1110-1123, May 2018.
N. Karpathy, "Deep reinforcement learning: Pong from pixels," CS231n Convolutional Neural Networks for Visual Recognition, Stanford University, 2016.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.