S3 Data Processing at Scale: The Power of Manifests

Featured image for “S3 Data Processing at Scale: The Power of Manifests”

February 27, 2026


Learn how to use Amazon S3 manifests to safely reprocess millions of S3 objects at scale without data ambiguity or production incidents.

As consultants, we operate in the realm of clients with mature systems. One of the most common places for those systems is AWS. Amazon’s S3 (Simple Storage Service) remains one of the most popular enterprise storage solutions.

Let’s say your team’s system runs into a problem. You talk it over and you realize the answer is this:

“We need to reprocess everything in this S3 bucket.”

Maybe you’re correcting a bug in how objects were written. Maybe you need to backfill an entirely new data pipeline. Whatever the reason, the task sounds simple:

“Loop over the bucket.”

But at scale, processing S3 data is not just a scripting task. It’s a distributed systems problem. The lack of a control layer can be the difference between a smooth backfill and a production incident. Manifests are one approach to provide control.

In this post, we’ll discuss the use of manifests as a launchpad for large-scale inventory processing workflows. We’ll go over the problems of naive iteration at scale, what manifests are, why they are useful, and options for creating them.

Why Naive S3 Bucket Iteration Fails at Scale

Let’s review a couple facts about S3 Buckets and then look at how one might start processing their data.

  1. S3’s listing API uses cursor-based pagination
  2. A bucket is a moving target, it may receive updates as you are listing through it

The most common starting point looks something like this:

response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)

for obj in response["Contents"]:
    process(obj["Key"])

This works fine for thousands of objects. It becomes fragile for millions. When you list and process in the same loop, you tightly couple:

  • Discovery of work
  • Execution of work

This creates several risks. The lack of differentiation between these two tasks adds complexity to scaling. The contract of the S3 API makes the discovery portion of work difficult to parallelize. There’s no granular failure state for jobs, and the coupling of discovery and execution means every failed job costs more.

Most importantly, this approach means:

You cannot scope the work

If someone asks:

“Exactly which objects did we process?”

Can you answer?

Without a manifest, the answer is usually:

“Everything under that prefix… as of whenever the job ran.”

That’s not a contract. That’s a description.

At scale, you want:

“Here is the exact list of 12,482,391 keys we intended to process”

The result of naive processing isn’t usually catastrophic data corruption. It’s a lack of information. It’s a loss of confidence in data consistency and auditability. In short, it’s ambiguity.

Ambiguity at scale becomes operational stress.

How S3 Manifests Fix Large-Scale Data Processing

To perform the necessary decoupling and take back control in our architecture, we start by creating a manifest. Then we use that manifest to perform the processing.

A manifest gives you:

  • An immutable snapshot of scope
  • Clear separation between selection and execution
  • The ability to shard work safely. Once a manifest is created, you can split your manifest into chunks and fan out
  • Reproducibility (you can re-run the same manifest, or parts of that manifest)
  • Granularity. You can create a much clearer picture of what was intended to be processed, and how much was processed

Who S3 Manifest-Driven Processing Is For

It’s important to note this approach is geared toward teams that need to reprocess a lot of data. Live updates to S3 should be handled using the s3 object lifecycle. However, if you or your team are:

  • Planning a large backfill
  • Need to correct or adjust data on many objects
  • Building an ETL pipeline backed by existing S3 data

Manifests act as an excellent starting point for workflows where scale introduces significant complexity. The next step might be plugging that manifest into tools like AWS Glue or Apache Airflow, which let you programmatically orchestrate said workflows. Some of these tools may even handle manifest generation, although scoping features may be limited. Unfortunately, use of data orchestration tools is out of scope for this article.

What a Manifest Really Is

A manifest is simply a file with an explicit list of objects you intend to process. Optionally a manifest can include some metadata.

Example:

Bucket,key,StorageClass
my-bucket/data/2023/01/file1.json,STANDARD
my-bucket/data/2023/01/file2.json,EXPRESS_ONEZONE

So, instead of discovering work while executing it, you:

  • Generate a fixed list of keys
  • Store it as a file
  • Treat it as the contract for the reprocessing

This manifest can be stored in a variety of formats, but the most common AWS supported options are CSV, ORC and Parquet. In fact, Amazon uses them as a cornerstone of their own batch processing offering.

How to Generate an S3 Manifest: S3 Inventory vs DIY Scans

We’ll discuss two ways to generate a manifest for Amazon S3.

Option 1: S3 Inventory Report

Amazon S3 Inventory generates object listings for an entire bucket (or a single prefix) on a fixed schedule. All objects listed in a given report correspond to a specific scheduled snapshot timestamp chosen by s3. It can be set up on a per bucket basis, and it can take up to 48 hours for the first inventory manifest to be generated.

Output formats

  • CSV
  • ORC
  • Parquet

Scheduling options

  • Daily
  • Weekly

The simplicity of this approach is valuable, you can set up S3 inventory on a bucket in a matter of minutes. However, this simplicity comes with some tradeoffs.

Reports cannot be generated on demand.

You have restricted options for emitting additional metadata.

You are very limited in how you scope your manifest. Only being able to scope by latest version/all versions and a single prefix path.

Inventory files are generated from S3’s internal metadata index and reflect the bucket’s state at a single scheduled snapshot timestamp chosen by S3. All objects listed in a given report correspond to that same point in time.

You do not control when that snapshot is taken, and the report may be up to 24 hours (or 7 days) old when delivered.

S3 Inventory is well suited for workloads where:

  • A coherent point-in-time view of the bucket is required (atomicity)
  • Operational simplicity is preferred
  • Snapshot delay is acceptable

Inventory provides simplicity and consistency of scope, but not immediacy or flexibility.

Option 2: DIY Scans

You can write your own manifests directly using paginated ListObjectsV2 or ListObjectVersions calls. This is similar to our naive approach, with the key difference that we are just generating the listing of objects. We’re not doing any processing yet.

This approach gives you full control over when enumeration begins and how objects are selected.

Custom scans are appropriate when you need:

  • On-demand manifest generation
  • Complex filtering logic to scope your work
  • Precise control over selection criteria (e.g. more than one bucket)

It’s important to note this approach is not atomic in the sense that you are operating against the streaming API, not a snapshot of the bucket at a given point in time. The contents of your bucket(s) may change while paginating.

You also assume responsibility for:

  • Pagination correctness
  • API cost management
  • Failure handling

Custom scans provide immediacy and flexibility, but require more operational discipline.

S3 Manifest Generation: S3 Inventory vs DIY Scans

Use this comparison to decide whether S3 Inventory or a custom scan is a better fit for your manifest generation needs.

Approach Best For Pros Cons
S3 Inventory Report Point-in-time snapshots, simplicity, automated by AWS Atomic, set up in minutes No control over scheduling, limited scoping
DIY Scans On-demand or complexly filtered manifests Flexible, immediate, Scope control Developer overhead, demands more cost control, operating against a moving target

When to Use S3 Manifests for Reprocessing at Scale

Your requirements for manifest generation may vary, but this isn’t an unsolved problem, and there are open source options.

By creating a manifest, we make scope explicit before execution begins. A manifest turns an implicit operation into a defined contract: a concrete set of objects that represent the intended work. Once that contract exists, the rest of the system becomes easier to reason about. Optimization becomes easier. Progress is measurable. Auditing becomes straightforward. The operational surface area shrinks because ambiguity has been removed.

At scale, S3 manifest–driven processing is one of those small architectural decisions that quietly pays dividends—especially when you need safe, repeatable reprocessing instead of one-off scripts.

Hopefully this post has helped illustrate some of the benefits of using a manifest to drive data processing. Thanks for spending the time to explore this topic. I hope it proves useful in your own systems.

If your team is dealing with large S3 reprocessing or ambiguous data pipelines, our consultants can help design manifest-driven workflows tailored to your environment.

About The Author

More From Ryan Cross


Discuss This Article

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments