Consume Streaming data from Aws Kinesis, build Datalake in S3 and run Sql Quries from Athena.
Goal is not just show you our architecture but provide actual cloudformation to create entire datalake in matter of minutes.
If you understand above architecture and want to jump right into actual code, Link to Cloudformation template. Simply choose any stack name and create stack from cloudformation console or using aws cli. This will create all necessary resources including a test Lambda function to publish events to Kinesis. Run the test lambda from console using this sample JSON and observe the events in S3 buckets (in 60 seconds) and query from Athena.
- Running Test Lambda will publish events into Kinesis.
- Firehose consumes kinesis events, backs up the source event into a raw event bucket(30 day retention), converts it to parquet format based on Glue schema and determines the partition based on current time and writes to final S3 bucket, our datalake.
- Glue Table holds all the partition information for S3 data.
- Glue Crawler Job runs every 30 minutes, looks for any new documents in S3 bucket and create/updates/deletes partition metadata.
- Run sql queries in Athena which uses Glue Table partition metadata to efficiently query S3.
Now Lets go step by step and understand each resource in CF. click on the heading will lead to actual resource in cloudformation.
Resource related to store data in S3
- Kms Key to encrypt data written to buckets
- Allow Firehose to use this key
- S3 bucket, our actual data lake.
- Data stored in bucket is encrypted with Kms key
- S3 bucket to store raw Json event
- Documents will be deleted after 30 days.
- kinesis stream with 1 shard (1 partition in kafka language)
- Data in kinesis encrypted with Kms.
- Test Lambda written in nodejs to publish events into Kinesis.
- Any code less than 4096 characters can be embedded right into cloudformation, which is exactly what I did.
All necessary policies for lambda function have been set here.
- Key needed to encrypt data written kinesis data.
- key policy includes access to use for Lambda Role.
- Role used by firehose needs several access polices.
- Some of them include access to Glue table, upload to S3 buckets, to be able to read Streams, decrypt with kms keys, etc.
- Documents are partitioned when writing to S3. We can then give partition range in where conditions in Athena queries to avoid full bucket scan and save a lot of money and time.
Once cloudformation is executed, you should see all above resources created.
Run the Lambda from console with using this sample JSON event and you should see it S3 bucket in a minute.