What is Fault Tolerance?
In distributed systems, workers can fail unexpectedly due to crashes, OOM kills, network issues, or deployments. PyWorkflow’s fault tolerance ensures your workflows survive these failures and automatically resume from where they left off.Automatic Detection
Worker crashes are detected automatically when Celery requeues tasks.
Event Replay
Completed steps are restored from the event log without re-execution.
Checkpoint Resume
Workflows continue from the last successful checkpoint, not from the beginning.
Configurable Limits
Control recovery attempts and behavior per workflow or globally.
How Auto Recovery Works
When a worker crashes mid-workflow, PyWorkflow automatically recovers:Configuration
- Decorator
- Config File
- Programmatic
Configure recovery per-workflow using the
@workflow decorator:Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
recover_on_worker_loss | bool | True (durable) / False (transient) | Enable automatic recovery on worker crash |
max_recovery_attempts | int | 3 | Maximum number of recovery attempts before marking as failed |
Configuration Priority
When resolving recovery settings, PyWorkflow uses this priority order:| Priority | Source | Example |
|---|---|---|
| 1 (highest) | @workflow() decorator | @workflow(recover_on_worker_loss=True) |
| 2 | pyworkflow.configure() | configure(default_recover_on_worker_loss=True) |
| 3 | Config file | recovery.recover_on_worker_loss: true |
| 4 (lowest) | Built-in defaults | True for durable, False for transient |
Durable vs Transient Workflows
Recovery behavior differs based on workflow durability:- Durable Workflows
- Transient Workflows
Durable workflows resume from the last checkpoint.Recovery process:
- Load event log from storage
- Replay
step_completedevents (restore cached results) - Complete pending
sleep_startedevents - Continue execution from the next step
Workflow States
Auto recovery introduces a new workflow state:| Status | Description |
|---|---|
INTERRUPTED | Worker crashed; workflow is awaiting recovery |
RUNNING | Workflow is executing (or has been recovered) |
Monitoring Recovery
Use the CLI to monitor workflows that have been interrupted or recovered:When to Disable Recovery
Non-idempotent external calls
Non-idempotent external calls
If your workflow makes calls that can’t be safely repeated (e.g., charging a credit card without idempotency keys), disable recovery or implement compensation logic.
Critical workflows requiring human review
Critical workflows requiring human review
Some workflows should fail loudly and require human intervention rather than automatic recovery.
External systems without rollback
External systems without rollback
If your workflow interacts with systems that don’t support rollback or compensation, partial re-execution could leave inconsistent state.
Best Practices
Make steps idempotent
Make steps idempotent
Design steps to produce the same result when called multiple times with the same input. Use idempotency keys for external API calls.
Set appropriate recovery limits
Set appropriate recovery limits
Don’t allow unlimited recovery attempts. Set
max_recovery_attempts based on your tolerance for repeated failures.Monitor interrupted workflows
Monitor interrupted workflows
Set up alerts for workflows that reach
INTERRUPTED status frequently. This may indicate infrastructure issues.Use durable mode for critical workflows
Use durable mode for critical workflows
Critical business workflows should always use durable mode to ensure proper recovery.