Knowledge Hub

Practical field notes for cloud database delivery.

A focused technical library for AWS database, PostgreSQL, Oracle modernisation, migration, HA/DR, FinOps and production reliability. Use it as a fast signal of how I assess risk, plan cutovers and communicate production decisions — without overloading the homepage.

How to use this page

Start with the field guide, then browse by topic.

The articles are written as short delivery briefs for engineering managers, recruiters and partner teams: what matters, what can go wrong, and what a practical next step looks like in production.

★ Start here AWS · Migration

Oracle to Aurora PostgreSQL: A Field Guide for ANZ Enterprises

The real decision framework for Oracle-to-AWS migration — not the marketing version. If you only read one note, read this: it covers PL/SQL survival rates, the realistic project-length signals, and where SCT and DMS actually help.

5 min read · click to expand

Every Oracle-to-PostgreSQL migration starts with the same question: how much of our PL/SQL will survive? The answer determines whether you're looking at a 3-month project or a 12-month programme.

From my experience across enterprise migrations in New Zealand and Australia, the pattern is consistent. AWS Schema Conversion Tool (SCT) handles roughly 60–80% of Oracle PL/SQL automatically. The remaining 20–40% — usually involving DBMS_SCHEDULER, UTL_FILE, Oracle-specific analytic functions, and complex cursor logic — needs manual rewriting. This is where most project timelines go wrong: teams underestimate the manual conversion effort by 2–3×.

The migration sequence that works

Run SCT and get an honest conversion report — read the red items, don't skim them.
Manually fix every "red" item up front; they do not sort themselves out later.
Set up AWS DMS with full-load + CDC so the target stays in sync while you validate.
Run validation passes: row counts, checksums, and application-level smoke tests.
Cut over in a short maintenance window once CDC lag is near zero.

The trap Oracle shops don't see coming

PostgreSQL's MVCC behaves differently. Oracle uses undo tablespaces; PostgreSQL keeps old row versions in-place and relies on VACUUM to reclaim space. If your application has long-running transactions, plan for bloat management from day one — not as an afterthought.

For ANZ enterprises running Oracle 12c–19c on-premises with 500 GB–5 TB databases, the typical migration to Aurora PostgreSQL takes 4–8 months including application testing. The cost savings are real — typically 40–60% on licensing alone — but only if the migration is planned as an engineering project, not a checkbox exercise.

Bottom line Budget for the 20–40% of PL/SQL that SCT can't convert, design for VACUUM and bloat before cutover, and treat the migration as an engineering programme. Do that and 40–60% licence savings are realistic; skip it and the timeline doubles.

DMS · Best Practices

AWS DMS: 10 Lessons from Enterprise Database Migrations

Hard-won lessons from enterprise DMS migration work — the gotchas teams often miss.

5 min read · click to expand

1. The replication instance is your bottleneck, not the source. DMS task throughput is bounded by the replication instance class. For multi-TB loads, start with dms.r5.4xlarge or larger. Undersizing here wastes days.

2. Supplemental logging is the #1 Oracle gotcha. For Oracle sources with CDC, enable supplemental logging at the database level AND on each table's primary key. Miss the table-level and CDC silently drops rows on UPDATE.

3. LOB mode matters enormously. Full LOB mode is correct but slow (one round trip per LOB). Limited LOB truncates above your configured size. Inline LOB (DMS 3.4.7+) gives best performance for LOBs under 8 KB. Choose deliberately — don't accept the default.

4. Table mapping is more powerful than people think. You can filter by schema, exclude audit tables, transform case (UPPER → lower), and add calculated columns — all in the JSON mapping rules. Read the documentation before writing application-side ETL.

5. Parallel load by column ranges beats parallel tables. For tables over 100 GB, configure ParallelLoadThreads with column range boundaries. This is faster than running multiple tables in parallel because you're splitting the I/O across the single largest table.

6. Validation is optional but invaluable. EnableValidation runs row-level comparisons after migration. It adds 30–50% to migration time but catches silent data drift. Run it on a representative sample for huge tables.

7. CDC lag has three places to look. Source latency (reading redo/WAL), target latency (writing to destination), and network. Check CDCLatencySource and CDCLatencyTarget CloudWatch metrics separately — they need different fixes.

8. Large transactions stall everything. DMS holds a complete transaction in memory until COMMIT. One 50 GB transaction stalls the entire task. Batch your large operations, or use the BatchApplyEnabled setting.

9. Schema conversion comes first, always. DMS moves data. AWS SCT converts schema and code. Run SCT first, fix conversion issues, create the target schema, then start DMS. The other order creates chaos.

10. Keep a reverse-CDC option for 48 hours post-cutover. A DMS task running in reverse (new target → old source) gives you rollback capability during the first critical hours. It's cheap insurance.

Bottom line Size the replication instance for the load, enable supplemental logging before CDC, choose LOB mode deliberately, and run SCT first. Most "DMS is slow/dropping data" incidents trace back to one of these four — not to DMS itself.

PostgreSQL · Operations

PostgreSQL Slot Bloat and Logical Replication Lag: The Complete Fix

Why replication slots silently fill your disk — and the exact steps to fix it.

4 min read · click to expand

Logical replication slots in PostgreSQL are the most common cause of silent disk exhaustion I've seen in production RDS environments. The mechanism is simple: a replication slot tells PostgreSQL to retain WAL (Write-Ahead Log) files until the consumer has confirmed receipt. If the consumer stops, disconnects, or falls behind, WAL files accumulate indefinitely — until pg_wal fills the disk and PostgreSQL halts with a fatal error.

How to detect it early

Query pg_replication_slots and compare confirmed_flush_lsn against pg_current_wal_lsn(). Any slot where active = false for more than an hour is a ticking bomb. On RDS, monitor the TransactionLogsDiskUsage CloudWatch metric.

The three culprits behind replication lag

Slow consumer — the subscriber can't apply changes fast enough, often due to missing indexes on the target or FK constraint checks.
Large transactions — PostgreSQL ships logical changes at COMMIT, so a single 10 GB transaction arrives as one burst.
TOAST expansion — toasted columns (large text/JSON) expand the decoded payload dramatically, sometimes 10× the on-disk size.

Fixing it

For slow consumers, add indexes on the subscriber's replicated tables and consider disabling FK checks during bulk catch-up. For large transactions, batch your writes. For TOAST overhead, filter unnecessary large columns from the publication using ALTER PUBLICATION ... SET TABLE ... (column_list) (PostgreSQL 15+). For emergency WAL pressure, the nuclear option is pg_drop_replication_slot() — but this means you lose your replication position and need to re-initialise.

Prevention

Set max_slot_wal_keep_size (PostgreSQL 13+) to cap WAL retention per slot. Monitor slot lag with alerting at 1 GB and critical at 5 GB.

Bottom line Treat inactive replication slots as incidents, not warnings — set max_slot_wal_keep_size and alert on slot lag before pg_wal ever fills. The fix is almost always cheap; the outage from ignoring it is not.

AWS · Architecture

Aurora vs RDS PostgreSQL: When to Choose Which

The decision framework that AWS doesn't put on the marketing page.

4 min read · click to expand

The headline difference is storage architecture. Aurora PostgreSQL uses a distributed, shared-storage layer — 6 copies across 3 Availability Zones, with replicas reading from the same storage as the writer. RDS PostgreSQL is vanilla PostgreSQL on EBS volumes, with read replicas streaming WAL from the primary.

Choose Aurora when…

You need fast failover — typically under 30 seconds, vs RDS's 60–120 seconds.
You need multiple read replicas with minimal lag — Aurora replicas see sub-20 ms lag since they share storage; RDS replicas lag by seconds.
You want auto-scaling storage without pre-provisioning.
Seconds of downtime matter financially.

Choose RDS PostgreSQL when…

You need specific extensions not supported on Aurora (the compatibility matrix is shorter than you'd expect).
You want exact version control — RDS tracks community PostgreSQL releases more closely.
Your workload is smaller and cost-sensitive — RDS's base price is lower on smaller instance classes.
You need pg_cron or other extensions with restrictions on Aurora.

The cost trap

Aurora's I/O-Optimized configuration includes I/O but charges a higher base storage rate. For write-heavy workloads (high WAL generation), Aurora I/O-Optimized often wins. For read-heavy workloads with modest storage, RDS gp3 with provisioned IOPS can be cheaper. Model your specific workload — don't assume Aurora is always more expensive.

The migration reality

Moving from RDS PostgreSQL to Aurora is straightforward (snapshot restore). Moving back is harder — once on Aurora, you're committed to its storage model.

Bottom line Pick Aurora for HA and low-lag reads, RDS for extension flexibility and cost-sensitive smaller workloads — and make the call before you build your HA architecture around Aurora's endpoint model, because the move back is the hard direction.

Oracle · DR

Oracle Data Guard to AWS Multi-AZ: A Decision Framework

Mapping Oracle HA/DR patterns to AWS equivalents without losing resilience.

4 min read · click to expand

Enterprises moving from Oracle Data Guard to AWS face a conceptual shift. Data Guard provides a physical standby (block-for-block copy via redo apply) or a logical standby (SQL-level apply). Both are flexible — you control switchover timing, protection modes, and can open standbys for read-only queries (Active Data Guard).

On AWS, the equivalent depends on the target engine. RDS Multi-AZ gives you a synchronous secondary in a different AZ — not readable until the newer Multi-AZ DB Cluster option. Aurora gives you up to 15 read replicas sharing the same storage layer, with automatic failover.

What Oracle shops miss most

Control over switchover timing — AWS failover is automatic and opinionated, not a manual decision.
Observer / FSFO equivalent — Data Guard's Fast-Start Failover with Observer has no direct AWS analogue; you get CloudWatch alarms and automatic failover instead.
Cross-region DR — Data Guard replicates to any connected site; AWS needs Aurora Global Database or RDS cross-region replicas, each with different lag characteristics.

The mapping table

Oracle Physical Standby → RDS Multi-AZ (automatic failover)
Oracle Active Data Guard → Aurora Read Replica (readable, sub-20 ms lag)
Oracle Data Guard Far Sync → Aurora Global Database (cross-region, typically under 1 s lag)
Oracle GoldenGate active-active → not directly available; consider DMS or application-level conflict resolution

The honest gap

Oracle Data Guard gives DBAs more control. AWS managed services give less control but more reliability for the common cases. The real question: do you trust Amazon's automation, or do you need to own the failover decision?

Bottom line Map each Data Guard role to its AWS equivalent before you migrate, and accept the trade: you give up manual switchover control in exchange for automation that's more reliable for the common cases. For most enterprises, trusting the automation frees real engineering time.

FinOps · Cost

Database FinOps on AWS: 7 Cost Levers Most Teams Miss

Practical cost optimisation from someone who did it inside AWS — not theory.

4 min read · click to expand

1. gp2 → gp3 storage migration. gp3 gives independent IOPS and throughput scaling at lower base cost. For any RDS workload under 16 TiB, this is a one-click modify with no downtime. Most teams save 15–20% on storage costs immediately.

2. Reserved Instance size flexibility. A db.r6g.large RI can be split across two db.r6g.medium instances within the same family. Plan RIs by family, not by exact size — you get flexibility most finance teams don't know exists.

3. Stop paying for idle dev/test databases. RDS instances can be stopped for up to 7 days (auto-restart after 7 days). For dev/test environments used only during business hours, schedule stop/start with Lambda or AWS Instance Scheduler. Savings: ~65% on instance hours.

4. Right-size before reserving. Performance Insights (free for 7-day retention) shows actual CPU and memory utilisation. If your db.r6g.2xlarge averages 15% CPU, downsize to db.r6g.large before committing to a 1-year RI on the wrong size.

5. Aurora I/O-Optimized for write-heavy workloads. If your Aurora cluster's I/O costs exceed 25% of total database spend, I/O-Optimized pricing (higher storage rate, zero I/O charges) usually saves money. Run the calculation — it takes 5 minutes.

6. Cross-region read replicas vs Global Database. If you only need cross-region reads (not failover), a cross-region read replica is cheaper than Aurora Global Database. Match the architecture to the actual DR requirement, not the aspirational one.

7. Snapshot and backup lifecycle. Automated backups are free up to the provisioned storage size. Manual snapshots are not. Audit your manual snapshots quarterly — stale snapshots from old projects accumulate silently and cost real money.

Bottom line The fastest wins need no architecture change: move gp2→gp3, stop idle dev/test instances, and right-size before you reserve. Those three alone typically cut a database bill by a quarter — the rest is matching architecture to the real requirement, not the aspirational one.

Production help / hiring signal

Turn field notes into a delivery plan.

These notes show the practical thinking behind my database work. If you are hiring a senior cloud database engineer, need AWS partner/subcontract support, or want an independent review before migration, cutover, performance rescue or HA/DR change, this page gives you evidence of how I structure production decisions.

AssessReview the current database estate, risks, dependencies and production constraints.

PlanConvert the right note into a migration, tuning, FinOps or HA/DR action plan.

DeliverSupport execution with runbooks, validation steps, stakeholder updates and clean handover material.

Discuss a project →View case studies

Practical field notes for cloud database delivery.

Start with the field guide, then browse by topic.

Oracle to Aurora PostgreSQL: A Field Guide for ANZ Enterprises

The migration sequence that works

The trap Oracle shops don't see coming

AWS DMS: 10 Lessons from Enterprise Database Migrations

PostgreSQL Slot Bloat and Logical Replication Lag: The Complete Fix

How to detect it early

The three culprits behind replication lag

Fixing it

Prevention

Aurora vs RDS PostgreSQL: When to Choose Which

Choose Aurora when…

Choose RDS PostgreSQL when…

The cost trap

The migration reality

Oracle Data Guard to AWS Multi-AZ: A Decision Framework

What Oracle shops miss most

The mapping table

The honest gap

Database FinOps on AWS: 7 Cost Levers Most Teams Miss

Turn field notes into a delivery plan.

Loading…