Error Handling in AWS Step Functions: Retry and Catch Patterns for Resilient Workflows
Learn how to handle errors gracefully in AWS Step Functions workflows. We'll explore retry configurations for transient failures and catch blocks for handling exceptions that shouldn't stop your workflow. Build more resilient serverless applications on AWS.
Workflows fail. Networks time out, services go down temporarily, and sometimes the data you receive is just invalid.
The question isn't if something will go wrong — it's how you'll recover from it.
In this post, let's explore two key techniques for handling errors in AWS Step Functions:
- How to retry steps that fail temporarily
- How to catch errors and prevent workflow termination
Thanks to AWS for sponsoring this article in my StepFunction Series.
The User Onboarding Workflow
Let's start with the user onboarding workflow we built in a previous blog post, where we integrated with multiple AWS services.
The workflow includes:
- Lambda function invocation for processing user data
- DynamoDB writes for storing user information
- SQS message sending for downstream processing
In this demo, we'll focus on error handling in the Lambda function step, but these same patterns apply to any state in your Step Functions workflow.
Simulating Errors in Lambda
To demonstrate error handling, let's modify the Lambda function to simulate different types of failures.
First, update the Lambda state to pass additional parameters:
{"user": {"userId": "user-12345","email": "john.doe@example.com","name": "John Doe","signupDate": "2025-10-20T10:30:00Z","source": "transient","metadata": {"referralCode": "FRIEND2025","country": "US"}},"retryCount.$": "$$.State.RetryCount"}
The retryCount value comes from the Step Functions context object, which tracks how many times a state has been retried.
Now modify the Lambda function to throw different exception types:
public class Function{public string FunctionHandler(UserRequest request, ILambdaContext context){var source = request.User.Source;if (source == "invalid"){throw new InvalidUserDataException("User data is invalid");}if (source == "transient"){// Simulate a transient error that resolves after retriesif (request.RetryCount < 2){throw new LambdaServiceException("Temporary service error");}}// Process user normallyreturn $"Successfully processed user {request.User.UserId}";}}public class InvalidUserDataException : Exception{public InvalidUserDataException(string message) : base(message) { }}public class LambdaServiceException : Exception{public LambdaServiceException(string message) : base(message) { }}
The LambdaServiceException simulates a transient error — the first two attempts will fail, but the third will succeed. This mimics real-world scenarios like network timeouts or temporary service unavailability.
The InvalidUserDataException represents a permanent error that won't be fixed by retrying.
Retry: Handling Transient Failures in AWS Step Functions
Let's execute the workflow without any retry configuration and see what happens when an error occurs.
Start an execution with the following input:
{"user": {"userId": "user-12345","email": "john.doe@example.com","name": "John Doe","signupDate": "2025-10-20T10:30:00Z","source": "transient","metadata": {"referralCode": "FRIEND2025","country": "US"}}}
When the Lambda function throws the exception, the entire workflow stops and terminates with a failure.
Looking at the execution history, you'll see the Task failed with error type: LambdaServiceException
The workflow is completely stopped, even though this is a transient error that might succeed on retry.
When a state fails and there's no error handling, the entire workflow terminates immediately. Subsequent states don't execute.
Whether this is acceptable depends on your use case. Sometimes you want the workflow to stop completely. Other times, you want to handle the error gracefully and continue processing.
Configuring Retry Logic
Step Functions allows you to configure automatic retries for failed states.
Navigate to your Lambda Invoke state in the Step Functions visual editor and select Error handling. You'll see that by default, Step Functions includes a retry configuration for common Lambda errors.
Click Edit on the retry configuration to add your custom exception, LambdaServiceException.

The retry configuration includes several important parameters:
- Interval Seconds: How long to wait before the first retry (default: 1 second)
- Max Attempts: Maximum number of retry attempts (default: 3)
- Backoff Rate: Multiplier applied to the interval for each subsequent retry (default: 2)
With an interval of 1 second, max attempts of 3, and backoff rate of 2, this exponential backoff prevents overwhelming the downstream service and gives temporary issues time to resolve.
Learn more about retry configuration in the AWS documentation.
Save the updated state machine and start a new execution with the same transient error input.
This time, the Lambda function will faile on the first 2 attempts and succeed on third.
Looking at the execution history, you'll see multiple TaskScheduled events with increasing retry counts. The workflow eventually completes successfully without manual intervention.
By adding retry logic, we've made the workflow resilient to transient failures like network errors, temporary service outages, or rate limiting — scenarios where a simple retry is likely to succeed.
Catch: Handling Errors Gracefully in AWS Step Functions
Some errors can't be fixed by retrying. Invalid input data, business rule violations, or authorization failures won't resolve on their own.
For these scenarios, use catch blocks to intercept errors and define alternative paths.
Adding a Catch Block
Select your Lambda Invoke state and navigate to Error handling. Click Add new catcher.
Configure the catcher by specifying the error name it handles (InvalidUserDataException) and also the fallback state it needs to proceed with in case of an error.

The Pass state is a simple state that passes its input to output without performing work. It's useful for adding comments or placeholders in your workflow.
Now your workflow has an alternative path:
- Lambda succeeds → Continue to next state
- Lambda throws
InvalidUserDataException→ Go to Pass state → End
When the workflow executes with input data that triggers an error path (e.g., "source": "invalid"), the catch block intercepts the exception and transitions to the Pass state. The workflow completes successfully rather than terminating with a failure — it handles the error gracefully.
Depending on your business requirements, you might want to continue processing even when an error occurs.
Retry vs Catch: When to Use Each
Understanding when to use retry versus catch is critical for building resilient workflows.
Use Retry For:
- Network errors: Temporary connection issues
- Service throttling: Rate limit errors (503, 429)
- Timeout errors: Requests that occasionally take too long
- Transient service failures: Services that are temporarily unavailable
These are errors where the same operation is likely to succeed if you try again.
Use Catch For:
- Invalid input: Data that doesn't meet validation rules
- Business rule violations: Operations that shouldn't be retried
- Authorization errors: Insufficient permissions (unlikely to change on retry)
- Resource not found errors: Missing records or resources
These are errors where retrying won't help — you need to handle them differently.
You can use both patterns together. This gives you resilience for temporary issues while still handling permanent failures appropriately.
The patterns we've explored aren't limited to Lambda functions.Every AWS service integration in Step Functions supports error handling.
Navigate to any state in your workflow, go to Error handling, and configure retries and catchers specific to that service's error types.