16 minute read

aws-data-analytics-exam
Image Source: Pexels

Last week I passed the AWS Data Analytics Specialty exam and since many of my colleagues and LinkedIn connections asked me to share my experience, I decided to write a small blog post.

I hope that this will give you enough information to flatten the learning curve and decrease the time needed for exam preparation.

If you have any additional questions and suggestions, please don’t hesitate to reach out.

I wish you all the best on the exam and I’m sure you’ll do great! 🥳 🎉

Questions

The following are some questions that I had on the exam and insights I gathered:

  • I’ve focused a lot on EMR and everything about the integration of Apache projects with AWS services. However, for some reason, I had like ~5 EMR questions related to HDFS/EMRFS, Hive, Pig scripts, and Apache Hudi. Honestly, I expected more focus on EMR
  • On the other hand, I had a lot of questions related to QuickSight, which I didn’t expect. From how to refresh SPICE using API, managing user space and cross-account setups to embedded dashboards, data sources and even graphs. I had two questions where they asked me about the best visualization/graph type for a particular case
  • As expected, Redshift was the center of attention:
  • Emphasis was on the difference between Real-Time, Near Real-Time, and Batch processing
  • Trick questions regarding Kinesis. I also had these on practice exams. They usually revolve around:
  • Kinesis Analytics SQL vs Flink apps: checkpointing, fault tolerance, parallel execution
  • Athena workgroups, cost usage limits - per-query control limits, workgroup-wide control limit
  • Glue cross-account crawlers and Data Catalogs: Granting cross-account access
  • Some random questions: DataSync, DMS, on-prem transfers, Direct Connect etc

Resources

Of course, the AWS documentation should be your primary source of information, but these courses and practice exams can help you pinpoint topics to focus on, since the AWS documentation can be quite overwhelming.

Tips

  • Read answers first - Questions can be extremely long and deliberately confusing. Sometimes, the best thing to do is to read the answers first. By doing so, you’ll get the idea of what you should focus on and easily discard the noise in the question

  • Just guess it - Exam lasts for 3 hours and there are 65 questions. If you do the quick maths, that’s ~2.7min per question. Since questions are quite long, it’ll take time if you want to understand every detail and re-read the question. If the question is unclear, goes into details which you cannot remember, or you just get confused - don’t get frustrated, just guess it, mark the question for review and move on

  • Take breaks - Reading À-la-recherche-du-temps-perdu-type of a question requires a lot of attention and focus. At some point your mind will start wandering. When that happens, just take a break. Look up at the ceiling or close your eyes, think about something else for 5 minutes and then continue

Study Notes

The following are some of the study notes that I gathered. Everything can be found in the official AWS documentation and for some bullets I included a * which links to the appropriate AWS documentation. These notes are just something that I found interesting and worth remembering, they in no way represent everything that needs to be covered for the exam.

Please note that I won’t keep this constantly up-to-date and if you find some mistakes or outdated information, please inform me and I’ll do my best to correct it ASAP.

Kafka MKS

  • Best way to size Kafka MSK cluster?
    • Use your on-prem cluster as a guideline
    • MSK calculator for pricing and sizing
  • Stores events as a continuous series of records and preserves the order in which the records were produced. Data consumers process data from Apache Kafka topics on a first-in-first-out basis, preserving the order data was produced *

Kinesis Data Stream

  • A partition key is used to group data by shard within a stream. It segregates the data records belonging to a stream into multiple shards. It uses the partition key that is associated with each data record to determine which shard a given record belongs to
  • Latency can increase if there is an increase in record count or record size for each GET request
  • Spark Streaming can read and write to Kinesis Data Streams
  • For PutRecords API a failed record is skipped and all subsequent records are processed. Therefore, the PutRecords API call does not guarantee data record ordering. PutRecord API guarantees record ordering when writing to the same shard *
  • The IncomingBytes and IncomingRecords metrics show you the rate at which your shard is ingesting data. These metrics will alert you when you have a hot shard *

Kinesis Client Library

  • Instantiates a record processor for each shard
  • Kinesis Data Streams shards support up to 1,000 Kinesis Data Streams records per second, or 1 MB throughput. The Kinesis Data Streams records per second limit binds customers with records smaller than 1 KB. Record aggregation allows customers to combine multiple records into a single Kinesis Data Streams record. This allows customers to improve their per shard throughput *
  • After de-aggregating the KDS record use KPL user record sequence number as your unique identifier. The KCL subsequence number is used primarily for checkpointing *
  • If we have KDS with 4 shards and one KCL app, it will process all 4 shards. If we add another KCL app, it will balance out with the first KCL app. So each KCL app will process 2 shards *

Kinesis Producer Library

  • Rate limiting is only possible through KPL and is implemented using tokens and buckets within Amazon Kinesis
  • PutRecords automatically adds any failed records back into the KPL buffer so it can be retried
  • Changing RecordMaxBufferedTime to a higher value will increase your aggregate package size. You must restart the KPL app for changes to take effect
  • The KPL is written in C++ and runs as a child process to the main process. Precompiled native binaries are bundled with Java release and are managed by the Java wrapper. KPL requires you to write your producer code in Java. If you want to write it using Python use KPL Aggregation and Deaggregation modules for AWS Lambda * **

Kinesis Firehose

  • Limits:
    • Max record size sent to Firehose: 1MB
    • Buffer size: 1MB to 128MB
    • Flush interval: 60 to 900 seconds
  • AVRO is not supported
  • When a Kinesis data stream is configured as the source of a Firehose delivery stream, the Firehose PutRecord and PutRecordBatch operations will be disabled *
  • You can change the delivery stream destination without interrupting the flow of data through the delivery stream by using the UpdateDestination API call *
  • Can be configured to write the original source data records to another S3 bucket
  • The SucceedProcessing metric data in CloudWatch tells you how many records were successfully processed over a period of time when using Lambda for transformation *
  • The TagDeliveryStream API operation allows you to apply tags to an existing delivery stream *

Kinesis Data Analytics

  • Flink apps can be written in Java or Scala
  • Using Flink you can leverage check pointing for fault tolerance while also leveraging parallel execution of tasks and allocating resources to implement scaling of your app *

DynamoDB

  • You can create on-demand backups and enable point-in-time recovery (PITR) for your DynamoDB tables
  • Row-level security for IAM users is possible with DynamoDB
  • ACID transactions are replicated from the source region to the replica regions only after the source region change is committed. This is the intended design of DynamoDB global tables *

Glue

  • To trigger a job after a crawler, use:
    • Lambda function and CloudWatch Event rule
    • AWS Glue workflow
  • Glue jobs can be scheduled at a minimum of 5min
  • For schemas to be considered similar the following conditions must be true *:
    • The partition threshold is higher than 70%
    • The maximum number of different schemas does not exceed 5
  • We can run Glue DataBrew on a schedule to check data quality, schema integrity *
  • Glue crawler - for data stored in Redshift and RDS you need to use JDBC connector. DynamoDB has the native DynamoDB interface for crawler
  • The Glue DynamicFrame does not require schema. It determines the schema in real-time while automatically resolving potential schema issues *
  • Glue crawler - for RDS, glue crawlers need all TCP ports open on the security group where the data source resides. To protect the database security group from outside access via a TCP port you also configure a self-referencing inbound rule for all TCP ports *
  • Glue worker node types *:
    • Standard
    • G.1X - Good for memory intensive jobs, uses 1DPU per worker (1DPU = 4vCPU, 16GB memory, 64GB disk)
    • G.2X - 2DPU per worker. We recommend this worker type for memory-intensive jobs and jobs that run machine learning transforms
    • G.025X - 0.25 DPU per worker. We recommend this worker type for low volume streaming jobs
  • When you enable job metrics in your Glue job def, the job initializes a GlueContext class which is then used to init SparkContext *
  • The Glue Unbox built-in transform reformats string fields, like a JSON field, into distinct fields representing the types of the composites *
  • When you have multiple concurrent jobs with job bookmarks and the maximum job concurrency is not set to 1, the job bookmark does not work correctly *
  • For crawler, you have to point it to a bucket/prefix. If you point it to a specific file (for example, .csv), it will create a table with correct column names but it won’t populate the table with data
  • The FindMatches transform will find duplicate records even when the records do not have a common unique identifier and no fields match exactly *

Aurora

  • Cannot scale past 64TB

Athena

  • Cost control *:
    • Athena allows you to set two types of cost controls: per-query limit and per-workgroup limit
    • For each workgroup, you can set only one per-query limit and multiple per-workgroup limits
    • The workgroup-wide data usage control limit specifies the total amount of data scanned for all queries that run in this workgroup during the specified time period. You can create multiple limits per workgroup. The workgroup-wide query limit allows you to set multiple thresholds on hourly or daily aggregates on data scanned by queries running in the workgroup
  • To make sure all of your Athena query data is encrypted, you have to encrypt the entire Glue data catalog and encrypt the results of your Athena queries which Athena stores in S3 result location

Redshift

  • Short Query Acceleration (SQA) can be used in place of WLM as a simple way to ensure short queries are not scheduled behind longer ones
  • To use Redshift Spectrum with data in an S3 bucket in different account: add a policy to the S3 bucket allowing S3 GET and LIST for an IAM role for Spectrum on the Redshift account
  • To maintain a real-time replica of Redshift cluster across multi AZ: spin up a separate cluster in a different AZ and using Kinesis simultaneously write data into each cluster. Use Route53 to direct to the nearest cluster when querying the data.
  • Is not Multi AZ
  • Automatically snapshots data to S3
  • Can automatically load in parallel from multiple compressed data files. Multiple concurrent COPY commands are much slower since it forces Redshift to perform a serialized load and requires a VACUUM. If you want to load data in parallel it’s better to split the data into separate files no more than 1GB and use a single COPY command *
  • Has much better performance than Athena for complex analytical queries
  • Enhanced VPC Routing forces Redshift to use the VPC for all COPY and UNLOAD commands, which can be seen in VPC Flow logs
  • Currently, you can only use Amazon S3-managed keys (SSE-S3) encryption (AES-256) for audit logging *
  • You can apply compression encodings to columns in tables manually, based on your own evaluation of the data. Or you can use the COPY command with COMPUPDATE set to ON to analyze and apply compression automatically based on sample data *
  • You can’t modify the destination AWS Region after cross-Region snapshot copy is configured. If you want to copy snapshots to a different AWS Region, first disable cross-Region snapshot copy. Then re-enable it with a new destination AWS Region and retention period *
  • COPY command requires three parameters *:
    • Table name
    • Data source
    • Authorization to access data using an IAM role
  • Node types *:
    • RA3 - if you expect rapid data growth
    • DC2 - if you have less than 10TB, without rapid growth
    • DS2 - legacy nodes, no longer in use
  • If we start with a small table but expect rapid growth it’s recommended to use AUTO distribution style
  • Using a stored procedure in Redshift you can limit data access to users. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default *
  • Automatic VACUUM operations can pause if the cluster experiences a period of high load
  • HSM encryption is the most secure encryption you can use on Redshift cluster. You cannot modify existing cluster to use HSM. You have to create a new cluster with HSM and migrate the data *

EMR

  • Use S3DistCp to copy data from S3 into HDFS and process it locally, upon completion use S3DistCp to push the final results back to S3
  • Apache Hue and Apache Ambari are graphical front-ends for interacting with a cluster
  • Chunks of 64MB are ideal for HDFS
  • Encryption options: LUKS encryption, SSE-KMS, SSE-S3, EBS encryption
  • Pig integration with S3 *:
    • Directly write to HCatalog tables in S3
    • Submit Pig scripts stored in S3 using EMR console
    • Loading custom JAR files from S3 with the REGISTER command
  • HBase is designed to be an OLTP engine, allowing an architecture of high-volume transactional operations
  • HBase integration with S3 *:
    • Snapshots of HBase data to S3
    • Storage of HBase StoreFiles and metadata on S3
    • HBase read-replicas on S3
  • To scale cluster based on YARN memory usage use the metric YARNMemoryAvailablePercentage *
  • To perform actions on data stored in DynamoDB from EMR use Apache Hive *
  • Bootstrap actions to install additional software *:
    • Upload the required installation scripts to S3 and execute them using custom boostrap actions
    • Provision an EC2 instance with Amazon Linux and install the required libs. Create an AMI of it and use it to launch EMR cluster
  • To copy data from DynamoDB table into HDFS as csv files, create an external Hive table *:
    CREATE EXTERNAL TABLE hdfs_features_csv(...)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LOCATION 'hdfs:///user/hadoop/hive-test';
    INSERT OVERWRITE TABLE hdfs_features_csv SELECT * FROM ddb_features;
    
  • For ML apps use Cluster Compute Instance types *
  • We can run multiple steps in parallel to improve cluster utilization and save cost. The default value for the concurrency level is 10. You can choose between 2 and 256 steps that can run in parallel *
  • To add additional steps to a cluster we can use aws emr add-steps cli command *
  • The valid actions on failure for Hive scripts are *:
    • Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate
    • Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate
    • Continue: If the step fails, continue to the next step
  • If you have a cluster with multiple users who need different levels of access to data in Amazon S3 through EMRFS, you can set up a security configuration with IAM roles for EMRFS. EMRFS can assume a different service role for cluster EC2 instances based on the user or group making the request, or based on the location of data in Amazon S3 *
  • You can use AWS Service Catalog to centrally manage commonly deployed EMR cluster configurations
  • Kerberos without EC2 private key file *:
    • Cross-realm trust
    • External KDC - cluster KDC on a different cluster with Active Directory cross-realm trust

S3

  • Glacier Select allows you to perform filtering directly against Glacier objects using standard SQL
  • With S3 Select you can scan a subset of an object by specifying a range of bytes to query using the ScanRange parameter *
  • How to check integrity of an object uploaded to S3: To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value *

DMS

  • You can use DMS data validation to ensure that your data has migrated accurately. DMS compares the source and target records and then reports any mismatches *
  • When batch fails, DMS breaks the batch down and switches to one-by-one mode to apply transactions. After one-by-one for failed batch succeeds it switches back to the batch mode *
  • One of the ways your migration can slow down is because your source latency or target latency is high. To discover the problem monitor CloudWatch entries for CDCLatencySource and CDCLatencyTarget
  • If you start a DMS task with CDC you will not migrate views. The only way to migrate tables and views is to start full-load only DMS task *
  • When migrating CDC we use CDC recovery checkpoint in the source endpoint to start the CDC from specific time/point *

OpenSearch

  • To connect securely to Kibana:
    • Set up a reverse proxy server between your browser and Amazon OpenSearch service
    • Set up an SSH tunnel with port forwarding to allow access on port 5601
  • To move a Kibana dashboard from one OpenSearch domain to another, simply export the dashboard and then import it into the target domain

QuickSight

  • Can read Excel files directly
  • Does not support Parquet format while reading the data from S3
  • 4 ways to refresh SPICE data *:
    • UI
    • Refresh dataset by editing the dataset
    • Schedule refresh
    • Use CreateIngestion API
  • Using the Manage QuickSight option in QS console, you can whitelist the domains where you wish to have your dashboards embedded *
  • Has ML-powered anomaly detection insight *
  • Handles compressed files in gzip format automatically
  • Can use Presto

Data Pipeline

  • PigActivity provides native support for Pig scripts in AWS Data Pipeline without requirement to use ShellCommandActivity or EmrActivity *
  • HiveActivity makes it easier to set up an EMR activity and automatically creates Hive tables to run HiveQL *

Updated: