You are designing a service that aggregates clickstream data in batch and delivers reports to subscribers via email only
once per week. Data is extremely spikey, geographically distributed, high-scale, and unpredictable. How should you
design this system?
Use a large RedShift cluster to perform the analysis, and a fleet of Lambdas to perform record inserts into the RedShift tables. Lambda
will scale rapidly enough for the traffic spikes.
Use a CloudFront distribution with access log delivery to S3. Clicks should be recorded as querystring GETs to the distribution.
Reports are built and sent by periodically running EMR jobs over the access logs in S3.
Use API Gateway invoking Lambdas which PutRecords into Kinesis, and EMR running Spark performing GetRecords on Kinesis to
scale with spikes. Spark on EMR outputs the analysis to S3, which are sent out via email.
Use AWS Elasticsearch service and EC2 Auto Scaling groups. The Autoscaling groups scale based on click throughput and stream
into the Elasticsearch domain, which is also scalable. Use Kibana to generate reports periodically.
Because you only need to batch analyze, anything using streaming is a waste of money. CloudFront is a Gigabit-Scale
HTTP(S) global request distribution service, so it can handle scale, geo-spread, spikes, and unpredictability. The Access
Logs will contain the GET data and work just fine for batch analysis and email using EMR. Can I use Amazon CloudFront
if I expect usage peaks higher than 10 Gbps or 15,000 RPS? Yes. Complete our request for higher limits here, and we
will add more capacity to your account within two business days.