AWS Data Analytics Specialty Exam Experience and Study Notes

TABLE OF CONTENTS

Introduction

Last week I passed the AWS Data Analytics Specialty exam and since many of my colleagues and LinkedIn connections asked me to share my experience, I decided to write a small blog post.

I hope that this will give you enough information to flatten the learning curve and decrease the time needed for exam preparation.

If you have any additional questions and suggestions, please don't hesitate to reach out.

I wish you all the best on the exam and I'm sure you'll do great! 🥳 🎉

Questions

The following are some questions that I had on the exam and insights I gathered:

I've focused a lot on EMR and everything about the integration of Apache projects with AWS services. However, for some reason, I had like ~5 EMR questions related to HDFS/EMRFS, Hive, Pig scripts, and Apache Hudi. Honestly, I expected more focus on EMR
On the other hand, I had a lot of questions related to QuickSight, which I didn't expect. From how to refresh SPICE using API, managing user space and cross-account setups to embedded dashboards, data sources and even graphs. I had two questions where they asked me about the best visualization/graph type for a particular case
As expected, Redshift was the center of attention:
- Encryption types. HSM trick question
- All kinds of optimizations: Short Query Acceleration, WLM etc
- Distribution styles
- VACUUM (full, sort only, delete only...) and ANALYZE
- Table/Column-level permissions: Achieve finer-grained data security with column-level access control in Amazon Redshift
- Snapshots
Emphasis was on the difference between Real-Time, Near Real-Time, and Batch processing
Trick questions regarding Kinesis. I also had these on practice exams. They usually revolve around:
- Duplicated records on Producer/Consumer side: Handling Duplicate Records
- Size of records for Kinesis Data Stream, Kinesis Firehose and Lambda
- Kinesis Producer Library buffering, retries, and rate limiting: KPL Retries and Rate Limiting
- Common problems when working with producers/consumers: Troubleshooting Kinesis Data Streams Consumers, Troubleshooting Amazon Kinesis Data Streams Producers
- Records out of order
- Kinesis Producer Library batching and aggregation: KPL Key Concepts
- PutRecords vs PutRecord when it comes to failed records: PutRecords
Kinesis Analytics SQL vs Flink apps: checkpointing, fault tolerance, parallel execution
Athena workgroups, cost usage limits - per-query control limits, workgroup-wide control limit
Glue cross-account crawlers and Data Catalogs: Granting cross-account access
Some random questions: DataSync, DMS, on-prem transfers, Direct Connect etc

Resources

Udemy Course - AWS Certified Data Analytics Specialty - Great course! It gives a really nice overview of all topics needed for the exam. Unfortunately, it doesn't go into details
Book - AWS Certified Data Analytics Study Guide with Online Labs: Specialty DAS-C01 Exam - When it comes to studying, books are our best friends. Unfortunately, that's not always the case with AWS due to rapid changes in technology. I've used this book, it's great, but be aware of outdated information
Practice Exams - Whizlabs: AWS Certified Data Analytics - Couldn't recommend it more. Be sure to go through example tests before the actual exam
Practice Exams - Udemy: AWS Certified Data Analytics Specialty - Nice practice exam
Udemy Course - AWS Certified Data Analytics Specialty Practice Exams - Someone recommended me this one, but I haven't had the time to check it out

Of course, the AWS documentation should be your primary source of information, but these courses and practice exams can help you pinpoint topics to focus on, since the AWS documentation can be quite overwhelming.

Tips

Read answers first - Questions can be extremely long and deliberately confusing. Sometimes, the best thing to do is to read the answers first. By doing so, you'll get the idea of what you should focus on and easily discard the noise in the question
Just guess it - Exam lasts for 3 hours and there are 65 questions. If you do the quick maths, that's ~2.7min per question. Since questions are quite long, it'll take time if you want to understand every detail and re-read the question. If the question is unclear, goes into details which you cannot remember, or you just get confused - don't get frustrated, just guess it, mark the question for review and move on
Take breaks - Reading À-la-recherche-du-temps-perdu-type of a question requires a lot of attention and focus. At some point your mind will start wandering. When that happens, just take a break. Look up at the ceiling or close your eyes, think about something else for 5 minutes and then continue

Study Notes

The following are some of the study notes that I gathered. Everything can be found in the official AWS documentation and for some bullets I included a * which links to the appropriate AWS documentation. These notes are just something that I found interesting and worth remembering, they in no way represent everything that needs to be covered for the exam.

Please note that I won't keep this constantly up-to-date and if you find some mistakes or outdated information, please inform me and I'll do my best to correct it ASAP.

Kafka MKS

Best way to size Kafka MSK cluster?
- Use your on-prem cluster as a guideline
- MSK calculator for pricing and sizing
Stores events as a continuous series of records and preserves the order in which the records were produced. Data consumers process data from Apache Kafka topics on a first-in-first-out basis, preserving the order data was produced *

Kinesis Data Stream

A partition key is used to group data by shard within a stream. It segregates the data records belonging to a stream into multiple shards. It uses the partition key that is associated with each data record to determine which shard a given record belongs to
Latency can increase if there is an increase in record count or record size for each GET request
Spark Streaming can read and write to Kinesis Data Streams
For PutRecords API a failed record is skipped and all subsequent records are processed. Therefore, the PutRecords API call does not guarantee data record ordering. PutRecord API guarantees record ordering when writing to the same shard *
The IncomingBytes and IncomingRecords metrics show you the rate at which your shard is ingesting data. These metrics will alert you when you have a hot shard *

Kinesis Client Library

Instantiates a record processor for each shard
Kinesis Data Streams shards support up to 1,000 Kinesis Data Streams records per second, or 1 MB throughput. The Kinesis Data Streams records per second limit binds customers with records smaller than 1 KB. Record aggregation allows customers to combine multiple records into a single Kinesis Data Streams record. This allows customers to improve their per shard throughput *
After de-aggregating the KDS record use KPL user record sequence number as your unique identifier. The KCL subsequence number is used primarily for checkpointing *
If we have KDS with 4 shards and one KCL app, it will process all 4 shards. If we add another KCL app, it will balance out with the first KCL app. So each KCL app will process 2 shards *

Kinesis Producer Library

Rate limiting is only possible through KPL and is implemented using tokens and buckets within Amazon Kinesis
PutRecords automatically adds any failed records back into the KPL buffer so it can be retried
Changing RecordMaxBufferedTime to a higher value will increase your aggregate package size. You must restart the KPL app for changes to take effect
The KPL is written in C++ and runs as a child process to the main process. Precompiled native binaries are bundled with Java release and are managed by the Java wrapper. KPL requires you to write your producer code in Java. If you want to write it using Python use KPL Aggregation and Deaggregation modules for AWS Lambda * **

Kinesis Firehose

Limits:
- Max record size sent to Firehose: 1MB
- Buffer size: 1MB to 128MB
- Flush interval: 60 to 900 seconds
AVRO is not supported
When a Kinesis data stream is configured as the source of a Firehose delivery stream, the Firehose PutRecord and PutRecordBatch operations will be disabled *
You can change the delivery stream destination without interrupting the flow of data through the delivery stream by using the UpdateDestination API call *
Can be configured to write the original source data records to another S3 bucket
The SucceedProcessing metric data in CloudWatch tells you how many records were successfully processed over a period of time when using Lambda for transformation *
The TagDeliveryStream API operation allows you to apply tags to an existing delivery stream *

Kinesis Data Analytics

Flink apps can be written in Java or Scala
Using Flink you can leverage check pointing for fault tolerance while also leveraging parallel execution of tasks and allocating resources to implement scaling of your app *

DynamoDB

You can create on-demand backups and enable point-in-time recovery (PITR) for your DynamoDB tables
Row-level security for IAM users is possible with DynamoDB
ACID transactions are replicated from the source region to the replica regions only after the source region change is committed. This is the intended design of DynamoDB global tables *

Glue

To trigger a job after a crawler, use:
- Lambda function and CloudWatch Event rule
- AWS Glue workflow
Glue jobs can be scheduled at a minimum of 5min
For schemas to be considered similar the following conditions must be true *:
- The partition threshold is higher than 70%
- The maximum number of different schemas does not exceed 5
We can run Glue DataBrew on a schedule to check data quality, schema integrity *
Glue crawler - for data stored in Redshift and RDS you need to use JDBC connector. DynamoDB has the native DynamoDB interface for crawler
The Glue DynamicFrame does not require schema. It determines the schema in real-time while automatically resolving potential schema issues *
Glue crawler - for RDS, glue crawlers need all TCP ports open on the security group where the data source resides. To protect the database security group from outside access via a TCP port you also configure a self-referencing inbound rule for all TCP ports *
Glue worker node types *:
- Standard
- G.1X - Good for memory intensive jobs, uses 1DPU per worker (1DPU = 4vCPU, 16GB memory, 64GB disk)
- G.2X - 2DPU per worker. We recommend this worker type for memory-intensive jobs and jobs that run machine learning transforms
- G.025X - 0.25 DPU per worker. We recommend this worker type for low volume streaming jobs
When you enable job metrics in your Glue job def, the job initializes a GlueContext class which is then used to init SparkContext *
The Glue Unbox built-in transform reformats string fields, like a JSON field, into distinct fields representing the types of the composites *
When you have multiple concurrent jobs with job bookmarks and the maximum job concurrency is not set to 1, the job bookmark does not work correctly *
For crawler, you have to point it to a bucket/prefix. If you point it to a specific file (for example, .csv), it will create a table with correct column names but it won't populate the table with data
The FindMatches transform will find duplicate records even when the records do not have a common unique identifier and no fields match exactly *

Aurora

Cannot scale past 64TB

Athena

Cost control *:
- Athena allows you to set two types of cost controls: per-query limit and per-workgroup limit
- For each workgroup, you can set only one per-query limit and multiple per-workgroup limits
- The workgroup-wide data usage control limit specifies the total amount of data scanned for all queries that run in this workgroup during the specified time period. You can create multiple limits per workgroup. The workgroup-wide query limit allows you to set multiple thresholds on hourly or daily aggregates on data scanned by queries running in the workgroup
To make sure all of your Athena query data is encrypted, you have to encrypt the entire Glue data catalog and encrypt the results of your Athena queries which Athena stores in S3 result location

Redshift

Short Query Acceleration (SQA) can be used in place of WLM as a simple way to ensure short queries are not scheduled behind longer ones
To use Redshift Spectrum with data in an S3 bucket in different account: add a policy to the S3 bucket allowing S3 GET and LIST for an IAM role for Spectrum on the Redshift account
To maintain a real-time replica of Redshift cluster across multi AZ: spin up a separate cluster in a different AZ and using Kinesis simultaneously write data into each cluster. Use Route53 to direct to the nearest cluster when querying the data.
Is not Multi AZ
Automatically snapshots data to S3
Can automatically load in parallel from multiple compressed data files. Multiple concurrent COPY commands are much slower since it forces Redshift to perform a serialized load and requires a VACUUM. If you want to load data in parallel it's better to split the data into separate files no more than 1GB and use a single COPY command *
Has much better performance than Athena for complex analytical queries
Enhanced VPC Routing forces Redshift to use the VPC for all COPY and UNLOAD commands, which can be seen in VPC Flow logs
Currently, you can only use Amazon S3-managed keys (SSE-S3) encryption (AES-256) for audit logging *
You can apply compression encodings to columns in tables manually, based on your own evaluation of the data. Or you can use the COPY command with COMPUPDATE set to ON to analyze and apply compression automatically based on sample data *
You can't modify the destination AWS Region after cross-Region snapshot copy is configured. If you want to copy snapshots to a different AWS Region, first disable cross-Region snapshot copy. Then re-enable it with a new destination AWS Region and retention period *
COPY command requires three parameters *:
- Table name
- Data source
- Authorization to access data using an IAM role
Node types *:
- RA3 - if you expect rapid data growth
- DC2 - if you have less than 10TB, without rapid growth
- DS2 - legacy nodes, no longer in use
If we start with a small table but expect rapid growth it's recommended to use AUTO distribution style
Using a stored procedure in Redshift you can limit data access to users. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default *
Automatic VACUUM operations can pause if the cluster experiences a period of high load
HSM encryption is the most secure encryption you can use on Redshift cluster. You cannot modify existing cluster to use HSM. You have to create a new cluster with HSM and migrate the data *

EMR

Use S3DistCp to copy data from S3 into HDFS and process it locally, upon completion use S3DistCp to push the final results back to S3
Apache Hue and Apache Ambari are graphical front-ends for interacting with a cluster
Chunks of 64MB are ideal for HDFS
Encryption options: LUKS encryption, SSE-KMS, SSE-S3, EBS encryption
Pig integration with S3 *:
- Directly write to HCatalog tables in S3
- Submit Pig scripts stored in S3 using EMR console
- Loading custom JAR files from S3 with the REGISTER command
HBase is designed to be an OLTP engine, allowing an architecture of high-volume transactional operations
HBase integration with S3 *:
- Snapshots of HBase data to S3
- Storage of HBase StoreFiles and metadata on S3
- HBase read-replicas on S3
To scale cluster based on YARN memory usage use the metric YARNMemoryAvailablePercentage *
To perform actions on data stored in DynamoDB from EMR use Apache Hive *
Bootstrap actions to install additional software *:
- Upload the required installation scripts to S3 and execute them using custom boostrap actions
- Provision an EC2 instance with Amazon Linux and install the required libs. Create an AMI of it and use it to launch EMR cluster
To copy data from DynamoDB table into HDFS as csv files, create an external Hive table *:

CREATE EXTERNAL TABLE hdfs_features_csv(...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 'hdfs:///user/hadoop/hive-test';
INSERT OVERWRITE TABLE hdfs_features_csv SELECT * FROM ddb_features;

For ML apps use Cluster Compute Instance types *
We can run multiple steps in parallel to improve cluster utilization and save cost. The default value for the concurrency level is 10. You can choose between 2 and 256 steps that can run in parallel *
To add additional steps to a cluster we can use aws emr add-steps cli command *
The valid actions on failure for Hive scripts are *:
- Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate
- Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate
- Continue: If the step fails, continue to the next step
If you have a cluster with multiple users who need different levels of access to data in Amazon S3 through EMRFS, you can set up a security configuration with IAM roles for EMRFS. EMRFS can assume a different service role for cluster EC2 instances based on the user or group making the request, or based on the location of data in Amazon S3 *
You can use AWS Service Catalog to centrally manage commonly deployed EMR cluster configurations
Kerberos without EC2 private key file *:
- Cross-realm trust
- External KDC - cluster KDC on a different cluster with Active Directory cross-realm trust

S3

Glacier Select allows you to perform filtering directly against Glacier objects using standard SQL
With S3 Select you can scan a subset of an object by specifying a range of bytes to query using the ScanRange parameter *
How to check integrity of an object uploaded to S3: To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value *

DMS

You can use DMS data validation to ensure that your data has migrated accurately. DMS compares the source and target records and then reports any mismatches *
When batch fails, DMS breaks the batch down and switches to one-by-one mode to apply transactions. After one-by-one for failed batch succeeds it switches back to the batch mode *
One of the ways your migration can slow down is because your source latency or target latency is high. To discover the problem monitor CloudWatch entries for CDCLatencySource and CDCLatencyTarget
If you start a DMS task with CDC you will not migrate views. The only way to migrate tables and views is to start full-load only DMS task *
When migrating CDC we use CDC recovery checkpoint in the source endpoint to start the CDC from specific time/point *

OpenSearch

To connect securely to Kibana:
- Set up a reverse proxy server between your browser and Amazon OpenSearch service
- Set up an SSH tunnel with port forwarding to allow access on port 5601
To move a Kibana dashboard from one OpenSearch domain to another, simply export the dashboard and then import it into the target domain

QuickSight

Can read Excel files directly
Does not support Parquet format while reading the data from S3
4 ways to refresh SPICE data *:
- UI
- Refresh dataset by editing the dataset
- Schedule refresh
- Use CreateIngestion API
Using the Manage QuickSight option in QS console, you can whitelist the domains where you wish to have your dashboards embedded *
Has ML-powered anomaly detection insight *
Handles compressed files in gzip format automatically
Can use Presto

Data Pipeline

PigActivity provides native support for Pig scripts in AWS Data Pipeline without requirement to use ShellCommandActivity or EmrActivity *
HiveActivity makes it easier to set up an EMR activity and automatically creates Hive tables to run HiveQL *