A fault is defined as a component of a system deviating from its spec. A failure is when the system as a whole stops providing the required service to the user. Systems that can anticipate faults and cope with them are fault-tolerant or resilient. Obviously, we can't anticipate every type of fault so it only makes sense to talk about certain types.
Carefully thinking about assumption and interactions made in the software can help
Some tools exist to try and trigger faults as a way to see where the weak points in the application are.
This post was based on the book Designing Data-Intensive Applications by Martin Kleppmann.
Authored by Anthony Fox on 2020-05-09