Notes on re:Invent 2022 talks I attended
Here’s a rough dump of all the notes I took while attending AWS’ re:Invent this year. I’m focusing on Chalk Talks, which are not recorded and usually have a smaller live audience, but I also included links to the Breakout Sessions if you’re interested in watching them. I hope this is useful to someone!
Serverless stream processing with AWS Lambda, Amazon Kinesis & Kafka (SVS402)
Kinesis
The talk began with a discussion on a car data collection architecture using Kinesis Data Streams/Lambda/Firehose/DynamoDB/S3
Custom metric: Records/Batch
- Tune parameter: batch size
At the shard level, λ polls shard 1/sec and invokes synchronously with batch of messages. If it succeeds it goes on to the next batch, if errors, depends on how the event source map is configured:
- Default: invoke with same set of records until success or records age out of stream
- Report batch item failures: λ returns first failed item, advances pointer to that record and proceeds with other error handling
- Bisect: Split in half, try to process oldest half, if errors do the same until erroring record is found and process according to max retries and age
- Retries that are part of the bisecting process do not count towards max
- Max retries/age: send to failure destination or discard
(Same for DDB streams 👆)
Some other comments made:
- There’s a 1-1 between Kinesis shards and Lambda concurrency
- You (probably) don’t want the default event source mapping error handling
- Lambda metrics return success even though retry logic is invoked
- Async destination error in Lambda config is different from source mapping config (same as retries)
New stuff: event filtering
- Batch size and window are applied before filtering
- Reduces costs (filter is applied before λ)
Keep Lambda concurrency limits in mind when using Kinesis on demand
Some words on parallelization factor (note: keeps order, “shard of shards”, won’t work if all/most shards have same pkey) 
A performance step-by-step:
- Ensure working system
- Measure performance: time to process 1 message, average batch
- Tune producer
- Tune λ: code, powertune with batch
- Tune event source mapping: batch size/window, enhanced fan-out, parallelization factor, error handling
Max performance: ignore error config in ESM and retries and log all to CloudWatch
Tune λ memory size if slow
Kinesis producer library not recommended for Lambda in speakers opinion (at least for Python)
MSK
Scenario: fraud detection (transactions stream)
- Using Step Functions to process each batch: map state of store in DDB, detect using Fraud Detector, update DDB with prediction, if suspicious publish to SNS
Express workflow in Step Functions to resolve quickly
Error handling is not as complete as Kinesis (yet)
Scaling mechanism is different: no parallelization
Has event filtering as well
AWS configures a “poller” which is the thing that pulls and sends messages to the function. Starts with 1 poller and 1 λ and then scales pollers vertically and horizontally according to the need (up to same number of partitions)
Metrics to monitor
- Offset lag
- λ errors
- λ invocations
- λ duration
- λ concurrency
Multi region: Kinesis (producer writes to 2 Kinesis), Kafka (has replication)
Migrate from a relational database to Amazon DynamoDB (DAT306)
Relational DB challenges
- Hard to scale
- Accepts unbounded work
- Monolith: shared everything
Related to above
- Large scale joins are taxing
- Unpredictable performance: buffer cache vs data
- App code in DB means that code takes space in DB (stuff must be evaluated each time a query runs and so on)
Should you migrate?
- Yes: K/V lookups, 1:N queries, serverless apps, “internet-scale” apps, JSON documents
- No: ERP apps, data warehouses, data marts
Steps
- Document needed access patterns: name, input args (might be PKs), single or set, returned columns, frequency per hour (priority for each access)
- Relational to single table: data shapes (note: speaker is a defendant of single table philosophy)
- Understanding table data: item size, item count, velocity (cool/hot), type (dimension/fact)
- Deciding how to combine many entities into single table
- Denormalized join: join tables w/inner join
- Data shape: wide with dupe column values across tables
- DDB usage pattern: Get-Item (single read, low latency)
- Stacked with union: Union all queries filling with null for data not in each table
- Data shape: different entities on same table, storage efficient
- DDB usage pattern: Query (item collection returned, may need multiple reads)
- PK and SK
- Good idea to add SK even though not needed yet
- Possible PK choices: high cardinality, table PK, parent id, collection id, uuid
- Possible SK choices: same as PK, static 0, line item id, iso or epoch date, date.now(), uuid
- Decorate key and item for readability: add prefixes (i.e. pr- for product, ver- for version)
- Duplicate date for queries, future use (product id as pk, product id as product)
- Date formats iso8601, unix (for TTL - think about it on day one, even if you don’t need it now)
- Concatenated columns for index SK (hierarchy, similar to iso date)
- Sparse columns for sparse GSIs (SELECT CASE WHEN product = ‘car’ AND price > 70000 THEN price ELSE NULL END AS luxury)
- Align PK attributes for item collections: benefit of single table design (single query)
- Align other attributes for GSIs
Migration strategies
- Use DMS, Glue or ETL tools
- Write and run scripts to gain exp. with bulk loads
- Archive legacy data and migrate only dimension data, starting fresh
- Write to both DBs for some time for verification
- DMS Limitations: does not combine tables, can be used after preparing data with a VIEW, each thread writes at 200 WCU (12MB/min)
- Avoiding throttles
- DDB sheds traffic for two reasons: throughput beyond provisioned limits, hot key limits (>1000 WCU/s for any one PK)
- Avoid writing to any one PK faster than 1000/sec, regardless of table capacity
- Use round robin shuffle to fan out heavy writes to each PK
- Staging data to S3 -> DDB
- Custom code (Lambda/Container/EC2)
- New: DDB import from S3
- Free GSI WCUs when loading, no WCU usage (!)
Architecting for resiliency and high availability (ENU303)
IIRC most of this talk’s content is already covered by preparing for the Solutions Architect exam (setting up things across AZs and regions, that kind of thing). My takeaway from it was this set of tools:
Running containerized workloads on AWS Graviton–based instances (CMP408)
Question on how to convince devs to use Graviton–based
- Top down: costs
- Bottom up: modernization (app tech stack i.e. upgrade Java SDK, libs)
Even though ie 10% slower, if cost is 20% lower and SLO is OK that’s still a win
Discussion on how to build multi-arch containers with pipelines that look like this
- Native compile: install docker on Graviton/Apple silicon, build and push as usual, buildx/CodeBuild/CI runner can control Graviton instance remotely
- Cross compilation: compile arm64 target on x86 host, C/C++ (x-comp available) vs Go/Rust (built-in), copy foreign binaries, no run during build
- Docker buildx: Use own existing x86 HW, emulates arm64 in SW, best for containers requiring minimal processing to build
Having multi arch containers is good when combined with spot -> can use any arch instance for lower price
Can define ECS Fargate task as Linux ARM64 to use Graviton
Is there a tool to check for Graviton readiness? No, nothing automatic
When not to use Graviton? (I wonder if most of this holds with the new instance types, WS still does of course)
- CPU bound (frequency) monoliths that only scale vertically
- GPU intensive (ML)
- Windows Server
Build resilient applications using Amazon RDS and Aurora PostgreSQL (DAT316)
Some more tools/configs!
Timeouts
- Timeout layers
- Watch out for connection storms -> set lowest timeout time at DB level and increase up the stack
Global resiliency
- More manual (intended), async replication, data loss
- Managed RTO (Aurora)
- Aurora write forwarding, storage replication
JDBC driver is topologically aware -> handles DB connection failover scenarios better
- No such things in other langs yet
Postgres connections: around hundreds connections, not thousands
Future: RDS blue-green deployments
Online talks
Lots of cool things on these ones.
Observability the open-source way (COP301)
Deploy modern and effective data models with Amazon DynamoDB (DAT320)
Best practices for advanced serverless developers (SVS401)
From RDBMS to NoSQL (sponsored by MongoDB) (PRT314)
Building observable applications with OpenTelemetry (BOA310)
Building containers for AWS (CON325)
Most I knew, but I’m taking this with me: ECR pull-through cache
Introducing Trusted Language Extensions for PostgreSQL (DAT221)
Build scalable multi-tenant databases with Amazon Aurora (DAT318)
I asked the speaker why they didn’t opt for DMS for the migration process, the answer was that their DBAs had (presumably much more) experience with standard MySQL tooling