/

AWS

/

Cloud

/

Data Management

S3 Data Processing at Scale: The Power of Manifests

All Thought Leadership

Ryan Cross

February 27, 2026

Subscribe

Page Contents

Learn how to use Amazon S3 manifests to safely reprocess millions of S3 objects at scale without data ambiguity or production incidents.

As consultants, we operate in the realm of clients with mature systems. One of the most common places for those systems is AWS. Amazon’s S3 (Simple Storage Service) remains one of the most popular enterprise storage solutions.

Let’s say your team’s system runs into a problem. You talk it over and you realize the answer is this:

“We need to reprocess everything in this S3 bucket.”

Maybe you’re correcting a bug in how objects were written. Maybe you need to backfill an entirely new data pipeline. Whatever the reason, the task sounds simple:

“Loop over the bucket.”

But at scale, processing S3 data is not just a scripting task. It’s a distributed systems problem. The lack of a control layer can be the difference between a smooth backfill and a production incident. Manifests are one approach to provide control.

In this post, we’ll discuss the use of manifests as a launchpad for large-scale inventory processing workflows. We’ll go over the problems of naive iteration at scale, what manifests are, why they are useful, and options for creating them.

Why Naive S3 Bucket Iteration Fails at Scale

Let’s review a couple facts about S3 Buckets and then look at how one might start processing their data.

S3’s listing API uses cursor-based pagination
A bucket is a moving target, it may receive updates as you are listing through it

The most common starting point looks something like this:

response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)

for obj in response["Contents"]:
    process(obj["Key"])

This works fine for thousands of objects. It becomes fragile for millions. When you list and process in the same loop, you tightly couple:

Discovery of work
Execution of work

This creates several risks. The lack of differentiation between these two tasks adds complexity to scaling. The contract of the S3 API makes the discovery portion of work difficult to parallelize. There’s no granular failure state for jobs, and the coupling of discovery and execution means every failed job costs more.

Most importantly, this approach means:

You cannot scope the work

If someone asks:

“Exactly which objects did we process?”

Can you answer?

Without a manifest, the answer is usually:

“Everything under that prefix… as of whenever the job ran.”

That’s not a contract. That’s a description.

At scale, you want:

“Here is the exact list of 12,482,391 keys we intended to process”

The result of naive processing isn’t usually catastrophic data corruption. It’s a lack of information. It’s a loss of confidence in data consistency and auditability. In short, it’s ambiguity.

Ambiguity at scale becomes operational stress.

How S3 Manifests Fix Large-Scale Data Processing

To perform the necessary decoupling and take back control in our architecture, we start by creating a manifest. Then we use that manifest to perform the processing.

A manifest gives you:

An immutable snapshot of scope
Clear separation between selection and execution
The ability to shard work safely. Once a manifest is created, you can split your manifest into chunks and fan out
Reproducibility (you can re-run the same manifest, or parts of that manifest)
Granularity. You can create a much clearer picture of what was intended to be processed, and how much was processed

Who S3 Manifest-Driven Processing Is For

It’s important to note this approach is geared toward teams that need to reprocess a lot of data. Live updates to S3 should be handled using the s3 object lifecycle. However, if you or your team are:

Planning a large backfill
Need to correct or adjust data on many objects
Building an ETL pipeline backed by existing S3 data

Manifests act as an excellent starting point for workflows where scale introduces significant complexity. The next step might be plugging that manifest into tools like AWS Glue or Apache Airflow, which let you programmatically orchestrate said workflows. Some of these tools may even handle manifest generation, although scoping features may be limited. Unfortunately, use of data orchestration tools is out of scope for this article.

What a Manifest Really Is

A manifest is simply a file with an explicit list of objects you intend to process. Optionally a manifest can include some metadata.

Example:

Bucket,key,StorageClass
my-bucket/data/2023/01/file1.json,STANDARD
my-bucket/data/2023/01/file2.json,EXPRESS_ONEZONE

So, instead of discovering work while executing it, you:

Generate a fixed list of keys
Store it as a file
Treat it as the contract for the reprocessing

This manifest can be stored in a variety of formats, but the most common AWS supported options are CSV, ORC and Parquet. In fact, Amazon uses them as a cornerstone of their own batch processing offering.

How to Generate an S3 Manifest: S3 Inventory vs DIY Scans

We’ll discuss two ways to generate a manifest for Amazon S3.

Option 1: S3 Inventory Report

Amazon S3 Inventory generates object listings for an entire bucket (or a single prefix) on a fixed schedule. All objects listed in a given report correspond to a specific scheduled snapshot timestamp chosen by s3. It can be set up on a per bucket basis, and it can take up to 48 hours for the first inventory manifest to be generated.

Output formats

CSV
ORC
Parquet

Scheduling options

Daily
Weekly

The simplicity of this approach is valuable, you can set up S3 inventory on a bucket in a matter of minutes. However, this simplicity comes with some tradeoffs.

Reports cannot be generated on demand.

You have restricted options for emitting additional metadata.

You are very limited in how you scope your manifest. Only being able to scope by latest version/all versions and a single prefix path.

Inventory files are generated from S3’s internal metadata index and reflect the bucket’s state at a single scheduled snapshot timestamp chosen by S3. All objects listed in a given report correspond to that same point in time.

You do not control when that snapshot is taken, and the report may be up to 24 hours (or 7 days) old when delivered.

S3 Inventory is well suited for workloads where:

A coherent point-in-time view of the bucket is required (atomicity)
Operational simplicity is preferred
Snapshot delay is acceptable

Inventory provides simplicity and consistency of scope, but not immediacy or flexibility.

Option 2: DIY Scans

You can write your own manifests directly using paginated ListObjectsV2 or ListObjectVersions calls. This is similar to our naive approach, with the key difference that we are just generating the listing of objects. We’re not doing any processing yet.

This approach gives you full control over when enumeration begins and how objects are selected.

Custom scans are appropriate when you need:

On-demand manifest generation
Complex filtering logic to scope your work
Precise control over selection criteria (e.g. more than one bucket)

It’s important to note this approach is not atomic in the sense that you are operating against the streaming API, not a snapshot of the bucket at a given point in time. The contents of your bucket(s) may change while paginating.

You also assume responsibility for:

Pagination correctness
API cost management
Failure handling

Custom scans provide immediacy and flexibility, but require more operational discipline.

S3 Manifest Generation: S3 Inventory vs DIY Scans

Use this comparison to decide whether S3 Inventory or a custom scan is a better fit for your manifest generation needs.

Approach	Best For	Pros	Cons
S3 Inventory Report	Point-in-time snapshots, simplicity, automated by AWS	Atomic, set up in minutes	No control over scheduling, limited scoping
DIY Scans	On-demand or complexly filtered manifests	Flexible, immediate, Scope control	Developer overhead, demands more cost control, operating against a moving target

When to Use S3 Manifests for Reprocessing at Scale

Your requirements for manifest generation may vary, but this isn’t an unsolved problem, and there are open source options.

By creating a manifest, we make scope explicit before execution begins. A manifest turns an implicit operation into a defined contract: a concrete set of objects that represent the intended work. Once that contract exists, the rest of the system becomes easier to reason about. Optimization becomes easier. Progress is measurable. Auditing becomes straightforward. The operational surface area shrinks because ambiguity has been removed.

At scale, S3 manifest–driven processing is one of those small architectural decisions that quietly pays dividends—especially when you need safe, repeatable reprocessing instead of one-off scripts.

Hopefully this post has helped illustrate some of the benefits of using a manifest to drive data processing. Thanks for spending the time to explore this topic. I hope it proves useful in your own systems.
—
If your team is dealing with large S3 reprocessing or ambiguous data pipelines, our consultants can help design manifest-driven workflows tailored to your environment.

About The Author

More From Ryan Cross

About Keyhole Software

Expert team of software developer consultants solving complex software challenges for U.S. clients.

Get To Know Keyhole

Explore Our Services

Engagement Models

Share This Post

Discuss This Article

Subscribe

0 Comments

Oldest

Newest Most Voted

S3 Data Processing at Scale: The Power of Manifests

Why Naive S3 Bucket Iteration Fails at Scale

How S3 Manifests Fix Large-Scale Data Processing

Who S3 Manifest-Driven Processing Is For

What a Manifest Really Is

How to Generate an S3 Manifest: S3 Inventory vs DIY Scans

Option 1: S3 Inventory Report

Option 2: DIY Scans

S3 Manifest Generation: S3 Inventory vs DIY Scans

When to Use S3 Manifests for Reprocessing at Scale

About The Author

More From Ryan Cross

About Keyhole Software

Share This Post

Related Posts

Java Trends of 2026: Market Position, Enterprise Adoption, Version Distribution, and the AI Acceleration Angle

Cursor-Based Pagination vs Offset Pagination: Preserving User Navigation at Scale

Legacy Modernization Trends: 2026 Market Size, Growth Drivers, and Enterprise Adoption Data

Agentic AI Delivery in Practice: Autonomous Enterprise Execution with the Ralph Loop

Related Articles

Java Trends of 2026: Market Position, Enterprise Adoption, Version Distribution, and the AI Acceleration Angle

Cursor-Based Pagination vs Offset Pagination: Preserving User Navigation at Scale

Legacy Modernization Trends: 2026 Market Size, Growth Drivers, and Enterprise Adoption Data

Discuss This Article

Company

services

tech

Dev Blog

Subscribe Now

Subscribe Now