Notes on re:Invent 2022 talks I attended

2022-12-16

Here’s a rough dump of all the notes I took while attending AWS’ re:Invent this year. I’m focusing on Chalk Talks, which are not recorded and usually have a smaller live audience, but I also included links to the Breakout Sessions if you’re interested in watching them. I hope this is useful to someone!

Serverless stream processing with AWS Lambda, Amazon Kinesis & Kafka (SVS402)

Kinesis

The talk began with a discussion on a car data collection architecture using Kinesis Data Streams/Lambda/Firehose/DynamoDB/S3

Custom metric: Records/Batch

Tune parameter: batch size

At the shard level, λ polls shard 1/sec and invokes synchronously with batch of messages. If it succeeds it goes on to the next batch, if errors, depends on how the event source map is configured:

Default: invoke with same set of records until success or records age out of stream
Report batch item failures: λ returns first failed item, advances pointer to that record and proceeds with other error handling
Bisect: Split in half, try to process oldest half, if errors do the same until erroring record is found and process according to max retries and age
- Retries that are part of the bisecting process do not count towards max
Max retries/age: send to failure destination or discard

(Same for DDB streams 👆)

Some other comments made:

There’s a 1-1 between Kinesis shards and Lambda concurrency
You (probably) don’t want the default event source mapping error handling
Lambda metrics return success even though retry logic is invoked
Async destination error in Lambda config is different from source mapping config (same as retries)

New stuff: event filtering

Batch size and window are applied before filtering
Reduces costs (filter is applied before λ)

Keep Lambda concurrency limits in mind when using Kinesis on demand

Some words on parallelization factor (note: keeps order, “shard of shards”, won’t work if all/most shards have same pkey)

A performance step-by-step:

Ensure working system
Measure performance: time to process 1 message, average batch
Tune producer
Tune λ: code, powertune with batch
Tune event source mapping: batch size/window, enhanced fan-out, parallelization factor, error handling

Max performance: ignore error config in ESM and retries and log all to CloudWatch

Tune λ memory size if slow

Lambda power tuning tool

Kinesis producer library not recommended for Lambda in speakers opinion (at least for Python)

MSK

Scenario: fraud detection (transactions stream)

Using Step Functions to process each batch: map state of store in DDB, detect using Fraud Detector, update DDB with prediction, if suspicious publish to SNS
Express workflow in Step Functions to resolve quickly
Error handling is not as complete as Kinesis (yet)
Scaling mechanism is different: no parallelization
Has event filtering as well

AWS configures a “poller” which is the thing that pulls and sends messages to the function. Starts with 1 poller and 1 λ and then scales pollers vertically and horizontally according to the need (up to same number of partitions)

Metrics to monitor

Offset lag
λ errors
λ invocations
λ duration
λ concurrency

Multi region: Kinesis (producer writes to 2 Kinesis), Kafka (has replication)

Migrate from a relational database to Amazon DynamoDB (DAT306)

Relational DB challenges

Hard to scale
Accepts unbounded work
Monolith: shared everything

Related to above

Large scale joins are taxing
Unpredictable performance: buffer cache vs data
App code in DB means that code takes space in DB (stuff must be evaluated each time a query runs and so on)

Should you migrate?

Yes: K/V lookups, 1:N queries, serverless apps, “internet-scale” apps, JSON documents
No: ERP apps, data warehouses, data marts

Steps

Document needed access patterns: name, input args (might be PKs), single or set, returned columns, frequency per hour (priority for each access)
Relational to single table: data shapes (note: speaker is a defendant of single table philosophy)
- Understanding table data: item size, item count, velocity (cool/hot), type (dimension/fact)
- Deciding how to combine many entities into single table
- Denormalized join: join tables w/inner join
  - Data shape: wide with dupe column values across tables
  - DDB usage pattern: Get-Item (single read, low latency)
- Stacked with union: Union all queries filling with null for data not in each table
  - Data shape: different entities on same table, storage efficient
  - DDB usage pattern: Query (item collection returned, may need multiple reads)
PK and SK
- Good idea to add SK even though not needed yet
- Possible PK choices: high cardinality, table PK, parent id, collection id, uuid
- Possible SK choices: same as PK, static 0, line item id, iso or epoch date, date.now(), uuid
- Decorate key and item for readability: add prefixes (i.e. pr- for product, ver- for version)
- Duplicate date for queries, future use (product id as pk, product id as product)
- Date formats iso8601, unix (for TTL - think about it on day one, even if you don’t need it now)
- Concatenated columns for index SK (hierarchy, similar to iso date)
Sparse columns for sparse GSIs (SELECT CASE WHEN product = ‘car’ AND price > 70000 THEN price ELSE NULL END AS luxury)
Align PK attributes for item collections: benefit of single table design (single query)
Align other attributes for GSIs

Migration strategies

Use DMS, Glue or ETL tools
Write and run scripts to gain exp. with bulk loads
Archive legacy data and migrate only dimension data, starting fresh
Write to both DBs for some time for verification
DMS Limitations: does not combine tables, can be used after preparing data with a VIEW, each thread writes at 200 WCU (12MB/min)
Avoiding throttles
- DDB sheds traffic for two reasons: throughput beyond provisioned limits, hot key limits (>1000 WCU/s for any one PK)
- Avoid writing to any one PK faster than 1000/sec, regardless of table capacity
- Use round robin shuffle to fan out heavy writes to each PK
Staging data to S3 -> DDB
- Custom code (Lambda/Container/EC2)
- New: DDB import from S3
- Free GSI WCUs when loading, no WCU usage (!)

Architecting for resiliency and high availability (ENU303)

IIRC most of this talk’s content is already covered by preparing for the Solutions Architect exam (setting up things across AZs and regions, that kind of thing). My takeaway from it was this set of tools:

Running containerized workloads on AWS Graviton–based instances (CMP408)

Question on how to convince devs to use Graviton–based

Top down: costs
Bottom up: modernization (app tech stack i.e. upgrade Java SDK, libs)

Even though ie 10% slower, if cost is 20% lower and SLO is OK that’s still a win

Discussion on how to build multi-arch containers with pipelines that look like this

Native compile: install docker on Graviton/Apple silicon, build and push as usual, buildx/CodeBuild/CI runner can control Graviton instance remotely
Cross compilation: compile arm64 target on x86 host, C/C++ (x-comp available) vs Go/Rust (built-in), copy foreign binaries, no run during build
Docker buildx: Use own existing x86 HW, emulates arm64 in SW, best for containers requiring minimal processing to build

Having multi arch containers is good when combined with spot -> can use any arch instance for lower price

Can define ECS Fargate task as Linux ARM64 to use Graviton

Is there a tool to check for Graviton readiness? No, nothing automatic

When not to use Graviton? (I wonder if most of this holds with the new instance types, WS still does of course)

CPU bound (frequency) monoliths that only scale vertically
GPU intensive (ML)
Windows Server

Build resilient applications using Amazon RDS and Aurora PostgreSQL (DAT316)

Some more tools/configs!

Timeouts

Timeout layers
Watch out for connection storms -> set lowest timeout time at DB level and increase up the stack

Global resiliency

More manual (intended), async replication, data loss
Managed RTO (Aurora)
Aurora write forwarding, storage replication

JDBC driver is topologically aware -> handles DB connection failover scenarios better

No such things in other langs yet

Postgres connections: around hundreds connections, not thousands

Future: RDS blue-green deployments

Online talks

Lots of cool things on these ones.

Notes on re:Invent 2022 talks I attended

Serverless stream processing with AWS Lambda, Amazon Kinesis & Kafka (SVS402)

Kinesis

MSK

Migrate from a relational database to Amazon DynamoDB (DAT306)

Architecting for resiliency and high availability (ENU303)

Running containerized workloads on AWS Graviton–based instances (CMP408)

Build resilient applications using Amazon RDS and Aurora PostgreSQL (DAT316)

Online talks

Observability the open-source way (COP301)

Deploy modern and effective data models with Amazon DynamoDB (DAT320)

Best practices for advanced serverless developers (SVS401)

From RDBMS to NoSQL (sponsored by MongoDB) (PRT314)

Building observable applications with OpenTelemetry (BOA310)

Building containers for AWS (CON325)

Introducing Trusted Language Extensions for PostgreSQL (DAT221)

Build scalable multi-tenant databases with Amazon Aurora (DAT318)