Question 1

A company uses Amazon Redshift as its data warehouse. A new table has columns that contain sensitive data. The data in the table will eventually be referenced by several existing queries that run many times a day. A data analyst needs to load 100 billion rows of data into the new table. Before doing so, the data analyst must ensure that only members of the auditing group can read the columns containing sensitive data. How can the data analyst meet these requirements with the lowest maintenance overhead?

Accepted Answer

A)   Load all the data into the new table and grant the auditing group permission to read from the table. Load all the data except for the columns containing sensitive data into a second table. Grant the appropriate users read-only permissions to the second table.
B)   Load all the data into the new table and grant the auditing group permission to read from the table. Use the GRANT SQL command to allow read-only access to a subset of columns to the appropriate users.
C)   Load all the data into the new table and grant all users read-only permissions to non-sensitive columns. Attach an IAM policy to the auditing group with explicit ALLOW access to the sensitive data columns.
D)   Load all the data into the new table and grant the auditing group permission to read from the table. Create a view of the new table that contains all the columns, except for those considered sensitive, and grant the appropriate users read-only permissions to the table.

Question 2

A data analyst is designing a solution to interactively query datasets with SQL using a JDBC connection. Users will join data stored in Amazon S3 in Apache ORC format with data stored in Amazon Elasticsearch Service (Amazon ES)  and Amazon Aurora MySQL. Which solution will provide the MOST up-to-date results?

Accepted Answer

A)   Use AWS Glue jobs to ETL data from Amazon ES and Aurora MySQL to Amazon S3. Query the data with Amazon Athena.
B)   Use Amazon DMS to stream data from Amazon ES and Aurora MySQL to Amazon Redshift. Query the data with Amazon Redshift.
C)   Query all the datasets in place with Apache Spark SQL running on an AWS Glue developer endpoint.
D)   Query all the datasets in place with Apache Presto running on Amazon EMR.

Question 3

A banking company wants to collect large volumes of transactional data using Amazon Kinesis Data Streams for real-time analytics. The company uses PutRecord to send data to Amazon Kinesis, and has observed network outages during certain times of the day. The company wants to obtain exactly once semantics for the entire processing pipeline. What should the company do to obtain these characteristics?

Accepted Answer

A)   Design the application so it can remove duplicates during processing be embedding a unique ID in each record.
B)   Rely on the processing semantics of Amazon Kinesis Data Analytics to avoid duplicate processing of events.
C)   Design the data producer so events are not ingested into Kinesis Data Streams multiple times.
D)   Rely on the exactly one processing semantics of Apache Flink and Apache Spark Streaming included in Amazon EMR.

Question 4

A manufacturing company uses Amazon Connect to manage its contact center and Salesforce to manage its customer relationship management (CRM)  data. The data engineering team must build a pipeline to ingest data from the contact center and CRM system into a data lake that is built on Amazon S3. What is the MOST efficient way to collect data in the data lake with the LEAST operational overhead?

Accepted Answer

A)   Use Amazon Kinesis Data Streams to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.
B)   Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon Kinesis Data Streams to ingest Salesforce data.
C)   Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.
D)   Use Amazon AppFlow to ingest Amazon Connect data and Amazon Kinesis Data Firehose to ingest Salesforce data.

Question 5

A company wants to provide its data analysts with uninterrupted access to the data in its Amazon Redshift cluster. All data is streamed to an Amazon S3 bucket with Amazon Kinesis Data Firehose. An AWS Glue job that is scheduled to run every 5 minutes issues a COPY command to move the data into Amazon Redshift. The amount of data delivered is uneven throughout the day, and cluster utilization is high during certain periods. The COPY command usually completes within a couple of seconds. However, when load spike occurs, locks can exist and data can be missed. Currently, the AWS Glue job is configured to run without retries, with timeout at 5 minutes and concurrency at 1. How should a data analytics specialist configure the AWS Glue job to optimize fault tolerance and improve data availability in the Amazon Redshift cluster?

Accepted Answer

A)   Increase the number of retries. Decrease the timeout value. Increase the job concurrency.
B)   Keep the number of retries at 0. Decrease the timeout value. Increase the job concurrency.
C)   Keep the number of retries at 0. Decrease the timeout value. Keep the job concurrency at 1.
D)   Keep the number of retries at 0. Increase the timeout value. Keep the job concurrency at 1.

Question 6

A healthcare company uses AWS data and analytics tools to collect, ingest, and store electronic health record (EHR)  data about its patients. The raw EHR data is stored in Amazon S3 in JSON format partitioned by hour, day, and year and is updated every hour. The company wants to maintain the data catalog and metadata in an AWS Glue Data Catalog to be able to access the data using Amazon Athena or Amazon Redshift Spectrum for analytics. When defining tables in the Data Catalog, the company has the following requirements: Choose the catalog table name and do not rely on the catalog table naming algorithm. Keep the table updated with new partitions loaded in the respective S3 bucket prefixes. Which solution meets these requirements with minimal effort?

Accepted Answer

A)   Run an AWS Glue crawler that connects to one or more data stores, determines the data structures, and writes tables in the Data Catalog.
B)   Use the AWS Glue console to manually create a table in the Data Catalog and schedule an AWS Lambda function to update the table partitions hourly.
C)   Use the AWS Glue API CreateTable operation to create a table in the Data Catalog. Create an AWS Glue crawler and specify the table as the source.
D)   Create an Apache Hive catalog in Amazon EMR with the table schema definition in Amazon S3, and update the table partition with a scheduled job. Migrate the Hive catalog to the Data Catalog.

Question 7

Three teams of data analysts use Apache Hive on an Amazon EMR cluster with the EMR File System (EMRFS)  to query data stored within each teams Amazon S3 bucket. The EMR cluster has Kerberos enabled and is configured to authenticate users from the corporate Active Directory. The data is highly sensitive, so access must be limited to the members of each team. Which steps will satisfy the security requirements?

Accepted Answer

A)   For the EMR cluster Amazon EC2 instances, create a service role that grants no access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the additional IAM roles to the cluster's EMR role for the EC2 trust policy. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.
B)   For the EMR cluster Amazon EC2 instances, create a service role that grants no access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the service role for the EMR cluster EC2 instances to the trust policies for the additional IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.
C)   For the EMR cluster Amazon EC2 instances, create a service role that grants full access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the service role for the EMR cluster EC2 instances to the trust polices for the additional IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.
D)   For the EMR cluster Amazon EC2 instances, create a service role that grants full access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the service role for the EMR cluster EC2 instances to the trust polices for the base IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

Question 8

A company's marketing team has asked for help in identifying a high performing long-term storage service for their data based on the following requirements: The data size is approximately 32 TB uncompressed. There is a low volume of single-row inserts each day. There is a high volume of aggregation queries each day. Multiple complex joins are performed. The queries typically involve a small subset of the columns in a table. Which storage service will provide the MOST performant solution?

Accepted Answer

A)   Amazon Aurora MySQL
B)   Amazon Redshift
C)   Amazon Neptune
D)   Amazon Elasticsearch

Question 9

A software company hosts an application on AWS, and new features are released weekly. As part of the application testing process, a solution must be developed that analyzes logs from each Amazon EC2 instance to ensure that the application is working as expected after each deployment. The collection and analysis solution should be highly available with the ability to display new information with minimal delays. Which method should the company use to collect and analyze the logs?

Accepted Answer

A)   Enable detailed monitoring on Amazon EC2, use Amazon CloudWatch agent to store logs in Amazon S3, and use Amazon Athena for fast, interactive log analytics.
B)   Use the Amazon Kinesis Producer Library (KPL)  agent on Amazon EC2 to collect and send data to Kinesis Data Streams to further push the data to Amazon Elasticsearch Service and visualize using Amazon QuickSight.
C)   Use the Amazon Kinesis Producer Library (KPL)  agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.
D)   Use Amazon CloudWatch subscriptions to get access to a real-time feed of logs and have the logs delivered to Amazon Kinesis Data Streams to further push the data to Amazon Elasticsearch Service and Kibana.

Question 10

A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time solution that can ingest the data securely at scale. The solution should also be able to remove the patient's protected health information (PHI)  from the streaming data and store the data in durable storage. Which solution meets these requirements with the least operational overhead?

Accepted Answer

A)   Ingest the data using Amazon Kinesis Data Streams, which invokes an AWS Lambda function using Kinesis Client Library (KCL)  to remove all PHI. Write the data in Amazon S3.
B)   Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Have Amazon S3 trigger an AWS Lambda function that parses the sensor data to remove all PHI in Amazon S3.
C)   Ingest the data using Amazon Kinesis Data Streams to write the data to Amazon S3. Have the data stream launch an AWS Lambda function that parses the sensor data and removes all PHI in Amazon S3.
D)   Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.

Question 11

A central government organization is collecting events from various internal applications using Amazon Managed Streaming for Apache Kafka (Amazon MSK) . The organization has configured a separate Kafka topic for each application to separate the data. For security reasons, the Kafka cluster has been configured to only allow TLS encrypted data and it encrypts the data at rest. A recent application update showed that one of the applications was configured incorrectly, resulting in writing data to a Kafka topic that belongs to another application. This resulted in multiple errors in the analytics pipeline as data from different applications appeared on the same topic. After this incident, the organization wants to prevent applications from writing to a topic different than the one they should write to. Which solution meets these requirements with the least amount of effort?

Accepted Answer

A)   Create a different Amazon EC2 security group for each application. Configure each security group to have access to a specific topic in the Amazon MSK cluster. Attach the security group to each application based on the topic that the applications should read and write to.
B)   Install Kafka Connect on each application instance and configure each Kafka Connect instance to write to a specific topic only.
C)   Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the clients' TLS certificates as the principal of the ACL.
D)   Create a different Amazon EC2 security group for each application. Create an Amazon MSK cluster and Kafka topic for each application. Configure each security group to have access to the specific cluster.

Question 12

An airline has been collecting metrics on flight activities for analytics. A recently completed proof of concept demonstrates how the company provides insights to data analysts to improve on-time departures. The proof of concept used objects in Amazon S3, which contained the metrics in .csv format, and used Amazon Athena for querying the data. As the amount of data increases, the data analyst wants to optimize the storage solution to improve query performance. Which options should the data analyst use to improve performance as the data lake grows? (Choose three.)

Accepted Answer

A)   Add a randomized string to the beginning of the keys in S3 to get more throughput across partitions.
B)   Use an S3 bucket in the same account as Athena.
C)   Compress the objects to reduce the data transfer I/O.
D)   Use an S3 bucket in the same Region as Athena.
E)   Preprocess the .csv data to JSON to reduce I/O by fetching only the document keys needed by the query.F)  Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.

Question 13

A company has developed an Apache Hive script to batch process data stared in Amazon S3. The script needs to run once every day and store the output in Amazon S3. The company tested the script, and it completes within 30 minutes on a small local three-node cluster. Which solution is the MOST cost-effective for scheduling and executing the script?

Accepted Answer

A)   Create an AWS Lambda function to spin up an Amazon EMR cluster with a Hive execution step. Set KeepJobFlowAliveWhenNoSteps to false and disable the termination protection flag. Use Amazon CloudWatch Events to schedule the Lambda function to run daily.
B)   Use the AWS Management Console to spin up an Amazon EMR cluster with Python Hue. Hive, and Apache Oozie. Set the termination protection flag to true and use Spot Instances for the core nodes of the cluster. Configure an Oozie workflow in the cluster to invoke the Hive script daily.
C)   Create an AWS Glue job with the Hive script to perform the batch operation. Configure the job to run once a day using a time-based schedule.
D)   Use AWS Lambda layers and load the Hive runtime to AWS Lambda and copy the Hive script. Schedule the Lambda function to run daily by creating a workflow using AWS Step Functions.

Question 14

An operations team notices that a few AWS Glue jobs for a given ETL application are failing. The AWS Glue jobs read a large number of small JSON files from an Amazon S3 bucket and write the data to a different S3 bucket in Apache Parquet format with no major transformations. Upon initial investigation, a data engineer notices the following error message in the History tab on the AWS Glue console: "Command Failed with Exit Code 1." Upon further investigation, the data engineer notices that the driver memory profile of the failed jobs crosses the safe threshold of 50% usage quickly and reaches 90-95% soon after. The average memory usage across all executors continues to be less than 4%. The data engineer also notices the following error while examining the related Amazon CloudWatch Logs.  What should the data engineer do to solve the failure in the MOST cost-effective way?

Accepted Answer

A)   Change the worker type from Standard to G.2X.
B)   Modify the AWS Glue ETL code to use the 'groupFiles': 'inPartition' feature.
C)   Increase the fetch size setting by using AWS Glue dynamics frame.
D)   Modify maximum capacity to increase the total maximum data processing units (DPUs)  used.

Question 15

A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by Amazon Athena. Users are seeing degradation in query performance as time progresses. Which action can help improve query performance?

Accepted Answer

A)   Merge the files in Amazon S3 to form larger files.
B)   Increase the number of shards in Kinesis Data Streams.
C)   Add more memory and CPU capacity to the streaming application.
D)   Write the files to multiple S3 buckets.

Question 16

A telecommunications company is looking for an anomaly-detection solution to identify fraudulent calls. The company currently uses Amazon Kinesis to stream voice call records in a JSON format from its on-premises database to Amazon S3. The existing dataset contains voice call records with 200 columns. To detect fraudulent calls, the solution would need to look at 5 of these columns only. The company is interested in a cost-effective solution using AWS that requires minimal effort and experience in anomaly-detection algorithms. Which solution meets these requirements?

Accepted Answer

A)   Use an AWS Glue job to transform the data from JSON to Apache Parquet. Use AWS Glue crawlers to discover the schema and build the AWS Glue Data Catalog. Use Amazon Athena to create a table with a subset of columns. Use Amazon QuickSight to visualize the data and then use Amazon QuickSight machine learning-powered anomaly detection.
B)   Use Kinesis Data Firehose to detect anomalies on a data stream from Kinesis by running SQL queries, which compute an anomaly score for all calls and store the output in Amazon RDS. Use Amazon Athena to build a dataset and Amazon QuickSight to visualize the results.
C)   Use an AWS Glue job to transform the data from JSON to Apache Parquet. Use AWS Glue crawlers to discover the schema and build the AWS Glue Data Catalog. Use Amazon SageMaker to build an anomaly detection model that can detect fraudulent calls by ingesting data from Amazon S3.
D)   Use Kinesis Data Analytics to detect anomalies on a data stream from Kinesis by running SQL queries, which compute an anomaly score for all calls. Connect Amazon QuickSight to Kinesis Data Analytics to visualize the anomaly scores.

Question 17

A company is sending historical datasets to Amazon S3 for storage. A data engineer at the company wants to make these datasets available for analysis using Amazon Athena. The engineer also wants to encrypt the Athena query results in an S3 results location by using AWS solutions for encryption. The requirements for encrypting the query results are as follows: Use custom keys for encryption of the primary dataset query results. Use generic encryption for all other query results. Provide an audit trail for the primary dataset queries that shows when the keys were used and by whom. Which solution meets these requirements?

Accepted Answer

A)   Use server-side encryption with S3 managed encryption keys (SSE-S3)  for the primary dataset. Use SSE-S3 for the other datasets.
B)   Use server-side encryption with customer-provided encryption keys (SSE-
C)   for the primary dataset. Use server-side encryption with S3 managed encryption keys (SSE-S3)  for the other datasets.
C)   Use server-side encryption with AWS KMS managed customer master keys (SSE-KMS CMKs)  for the primary dataset. Use server-side encryption with S3 managed encryption keys (SSE-S3)  for the other datasets.
D)   Use client-side encryption with AWS Key Management Service (AWS KMS)  customer managed keys for the primary dataset. Use S3 client-side encryption with client-side keys for the other datasets.

Question 18

A university intends to use Amazon Kinesis Data Firehose to collect JSON-formatted batches of water quality readings in Amazon S3. The readings are from 50 sensors scattered across a local lake. Students will query the stored data using Amazon Athena to observe changes in a captured metric over time, such as water temperature or acidity. Interest has grown in the study, prompting the university to reconsider how data will be stored. Which data format and partitioning choices will MOST significantly reduce costs? (Choose two.)

Accepted Answer

A)   Store the data in Apache Avro format using Snappy compression.
B)   Partition the data by year, month, and day.
C)   Store the data in Apache ORC format using no compression.
D)   Store the data in Apache Parquet format using Snappy compression.
E)   Partition the data by sensor, year, month, and day.

Question 19

A large company has a central data lake to run analytics across different departments. Each department uses a separate AWS account and stores its data in an Amazon S3 bucket in that account. Each AWS account uses the AWS Glue Data Catalog as its data catalog. There are different data lake access requirements based on roles. Associate analysts should only have read access to their departmental data. Senior data analysts can have access in multiple departments including theirs, but for a subset of columns only. Which solution achieves these required access patterns to minimize costs and administrative tasks?

Accepted Answer

A)   Consolidate all AWS accounts into one account. Create different S3 buckets for each department and move all the data from every account to the central data lake account. Migrate the individual data catalogs into a central data catalog and apply fine-grained permissions to give to each user the required access to tables and databases in AWS Glue and Amazon S3.
B)   Keep the account structure and the individual AWS Glue catalogs on each account. Add a central data lake account and use AWS Glue to catalog data from various accounts. Configure cross-account access for AWS Glue crawlers to scan the data in each departmental S3 bucket to identify the schema and populate the catalog. Add the senior data analysts into the central account and apply highly detailed access controls in the Data Catalog and Amazon S3.
C)   Set up an individual AWS account for the central data lake. Use AWS Lake Formation to catalog the cross-account locations. On each individual S3 bucket, modify the bucket policy to grant S3 permissions to the Lake Formation service-linked role. Use Lake Formation permissions to add fine-grained access controls to allow senior analysts to view specific tables and columns.
D)   Set up an individual AWS account for the central data lake and configure a central S3 bucket. Use an AWS Lake Formation blueprint to move the data from the various buckets into the central S3 bucket. On each individual bucket, modify the bucket policy to grant S3 permissions to the Lake Formation service-linked role. Use Lake Formation permissions to add fine-grained access controls for both associate and senior analysts to view specific tables and columns.

Question 20

A company's data analyst needs to ensure that queries executed in Amazon Athena cannot scan more than a prescribed amount of data for cost control purposes. Queries that exceed the prescribed threshold must be canceled immediately. What should the data analyst do to achieve this?

Accepted Answer

A)   Configure Athena to invoke an AWS Lambda function that terminates queries when the prescribed threshold is crossed.
B)   For each workgroup, set the control limit for each query to the prescribed threshold.
C)   Enforce the prescribed threshold on all Amazon S3 bucket policies
D)   For each workgroup, set the workgroup-wide data usage control limit to the prescribed threshold.

Exam 4: AWS Certified Data Analytics - Specialty (DAS-C01)

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

A company's data analyst needs to ensure that queries executed in Amazon Athena cannot scan more than a prescribed amount of data for cost control purposes. Queries that exceed the prescribed threshold must be canceled immediately. What should the data analyst do to achieve this?

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Exam 4: AWS Certified Data Analytics - Specialty (DAS-C01)

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

A company's data analyst needs to ensure that queries executed in Amazon Athena cannot scan more than a prescribed amount of data for cost control purposes. Queries that exceed the prescribed threshold must be canceled immediately. What should the data analyst do to achieve this?

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct AnswerverifiedShow Answer

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified

Correct Answer
verified