amazon AWS Certified Data Analytics - Specialty practice test

Last exam update: Dec 14 ,2024
Page 1 out of 11. Viewing questions 1-15 out of 164

Question 1

A central government organization is collecting events from various internal applications using Amazon Managed Streaming
for Apache Kafka (Amazon MSK). The organization has configured a separate Kafka topic for each application to separate
the data. For security reasons, the Kafka cluster has been configured to only allow TLS encrypted data and it encrypts the
data at rest.
A recent application update showed that one of the applications was configured incorrectly, resulting in writing data to a
Kafka topic that belongs to another application. This resulted in multiple errors in the analytics pipeline as data from different
applications appeared on the same topic. After this incident, the organization wants to prevent applications from writing to a
topic different than the one they should write to.
Which solution meets these requirements with the least amount of effort?

  • A. Create a different Amazon EC2 security group for each application. Configure each security group to have access to a specific topic in the Amazon MSK cluster. Attach the security group to each application based on the topic that the applications should read and write to.
  • B. Install Kafka Connect on each application instance and configure each Kafka Connect instance to write to a specific topic only.
  • C. Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the clients TLS certificates as the principal of the ACL.
  • D. Create a different Amazon EC2 security group for each application. Create an Amazon MSK cluster and Kafka topic for each application. Configure each security group to have access to the specific cluster.
Answer:

B

User Votes:
A
50%
B 1 votes
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 2

A data analytics specialist is building an automated ETL ingestion pipeline using AWS Glue to ingest compressed files that
have been uploaded to an Amazon S3 bucket. The ingestion pipeline should support incremental data processing.
Which AWS Glue feature should the data analytics specialist use to meet this requirement?

  • A. Workflows
  • B. Triggers
  • C. Job bookmarks
  • D. Classifiers
Answer:

B


Explanation:
Reference: https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-
incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html

User Votes:
A
50%
B 1 votes
50%
C 1 votes
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 3

A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The
company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon
Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants
to optimize the files to reduce the cost of querying and also improve the speed of data loading into the Amazon Redshift
cluster.
Which solution meets these requirements?

  • A. Use AWS Glue to convert all the files from .csv to a single large Apache Parquet file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
  • B. Use Amazon EMR to convert each .csv file to Apache Avro. COPY the files into Amazon Redshift and query the file with Athena from Amazon S3.
  • C. Use AWS Glue to convert the files from .csv to a single large Apache ORC file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
  • D. Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D 1 votes
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000
Haseebsayeed
3 months, 3 weeks ago

D. Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.


Question 4

A mobile gaming company wants to capture data from its gaming app and make the data available for analysis immediately.
The data record size will be approximately 20 KB. The company is concerned about achieving optimal throughput from each
device. Additionally, the company wants to develop a data stream processing application with dedicated throughput for each
consumer.
Which solution would achieve this goal?

  • A. Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature while consuming the data.
  • B. Have the app call the PutRecordBatch API to send data to Amazon Kinesis Data Firehose. Submit a support case to enable dedicated throughput on the account.
  • C. Have the app use Amazon Kinesis Producer Library (KPL) to send data to Kinesis Data Firehose. Use the enhanced fan- out feature while consuming the data.
  • D. Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Host the stream-processing application on Amazon EC2 with Auto Scaling.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D 1 votes
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 5

A company has 10-15 of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-
time query engine. The company wants to transform the data to optimize query runtime and storage costs.
Which option for data format and compression meets these requirements?

  • A. CSV compressed with zip
  • B. JSON compressed with bzip2
  • C. Apache Parquet compressed with Snappy
  • D. Apache Avro compressed with LZO
Answer:

B


Explanation:
Reference: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

User Votes:
A
50%
B 1 votes
50%
C 1 votes
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000
11 months, 1 week ago

As per the link For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable.


Question 6

A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time
solution that can ingest the data securely at scale. The solution should also be able to remove the patients protected health
information (PHI) from the streaming data and store the data in durable storage.
Which solution meets these requirements with the least operational overhead?

  • A. Ingest the data using Amazon Kinesis Data Streams, which invokes an AWS Lambda function using Kinesis Client Library (KCL) to remove all PHI. Write the data in Amazon S3.
  • B. Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Have Amazon S3 trigger an AWS Lambda function that parses the sensor data to remove all PHI in Amazon S3.
  • C. Ingest the data using Amazon Kinesis Data Streams to write the data to Amazon S3. Have the data stream launch an AWS Lambda function that parses the sensor data and removes all PHI in Amazon S3.
  • D. Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.
Answer:

C


Explanation:
Reference: https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-
and-aws-lambda/

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 7

A company is building a data lake and needs to ingest data from a relational database that has time-series data. The
company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental
data only from the source into Amazon S3.
What is the MOST cost-effective approach to meet these requirements?

  • A. Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.
  • B. Use AWS Glue to connect to the data source using JDBC Drivers. Store the last updated key in an Amazon DynamoDB table and ingest the data using the updated key as a filter.
  • C. Use AWS Glue to connect to the data source using JDBC Drivers and ingest the entire dataset. Use appropriate Apache Spark libraries to compare the dataset, and find the delta.
  • D. Use AWS Glue to connect to the data source using JDBC Drivers and ingest the full data. Use AWS DataSync to ensure the delta only is written into Amazon S3.
Answer:

B

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 8

A company has a business unit uploading .csv files to an Amazon S3 bucket. The companys data platform team has set up
an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the
created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon
Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into
the Amazon Redshift table.
Which solution will update the Redshift table without duplicates when jobs are rerun?

  • A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.
  • B. Load the previously inserted data into a MySQL database in the AWS Glue job. Perform an upsert operation in MySQL, and copy the results to the Amazon Redshift table.
  • C. Use Apache Spark’s DataFrame dropDuplicates() API to eliminate duplicates and then write the data to Amazon Redshift.
  • D. Use the AWS Glue ResolveChoice built-in transform to select the most recent value of the column.
Answer:

B

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 9

A company uses Amazon Redshift as its data warehouse. A new table includes some columns that contain sensitive data
and some columns that contain non-sensitive data. The data in the table eventually will be referenced by several existing
queries that run many times each day.
A data analytics specialist must ensure that only members of the companys auditing team can read the columns that contain
sensitive data. All other users must have read-only access to the columns that contain non-sensitive data.
Which solution will meet these requirements with the LEAST operational overhead?

  • A. Grant the auditing team permission to read from the table. Load the columns that contain non-sensitive data into a second table. Grant the appropriate users read-only permissions to the second table.
  • B. Grant all users read-only permissions to the columns that contain non-sensitive data. Use the GRANT SELECT command to allow the auditing team to access the columns that contain sensitive data.
  • C. Grant all users read-only permissions to the columns that contain non-sensitive data. Attach an IAM policy to the auditing team with an explicit. Allow action that grants access to the columns that contain sensitive data.
  • D. Grant the auditing team permission to read from the table. Create a view of the table that includes the columns that contain non-sensitive data. Grant the appropriate users read-only permissions to that view.
Answer:

D


Explanation:
Users with SELECT permission on a table can view the table data. Columns that are defined as masked, will display the
masked data. Grant the UNMASK permission to a user to enable them to retrieve unmasked data from the columns for
which masking is defined.
Reference: https://docs.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 10

A banking company is currently using Amazon Redshift for sensitive data. An audit found that the current cluster is
unencrypted. Compliance requires that a database with sensitive data must be encrypted using a hardware security module
(HSM) with customer managed keys.
Which modifications are required in the cluster to ensure compliance?

  • A. Create a new HSM-encrypted Amazon Redshift cluster and migrate the data to the new cluster.
  • B. Modify the DB parameter group with the appropriate encryption settings and then restart the cluster.
  • C. Enable HSM encryption in Amazon Redshift using the command line.
  • D. Modify the Amazon Redshift cluster from the console and enable encryption using the HSM option.
Answer:

A


Explanation:
When you modify your cluster to enable AWS KMS encryption, Amazon Redshift automatically migrates your data to a new
encrypted cluster.
Reference: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 11

A company has developed several AWS Glue jobs to validate and transform its data from Amazon S3 and load it into
Amazon RDS for MySQL in batches once every day. The ETL jobs read the S3 data using a DynamicFrame. Currently, the
ETL developers are experiencing challenges in processing only the incremental data on every run, as the AWS Glue job
processes all the S3 input data on each run.
Which approach would allow the developers to solve the issue with minimal coding effort?

  • A. Have the ETL jobs read the data from Amazon S3 using a DataFrame.
  • B. Enable job bookmarks on the AWS Glue jobs.
  • C. Create custom logic on the ETL jobs to track the processed S3 objects.
  • D. Have the ETL jobs delete the processed objects or data from Amazon S3 after each run.
Answer:

D

User Votes:
A
50%
B 1 votes
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 12

A company wants to use an automatic machine learning (ML) Random Cut Forest (RCF) algorithm to visualize complex real-
world scenarios, such as detecting seasonality and trends, excluding outers, and imputing missing values.
The team working on this project is non-technical and is looking for an out-of-the-box solution that will require the LEAST
amount of management overhead.
Which solution will meet these requirements?

  • A. Use an AWS Glue ML transform to create a forecast and then use Amazon QuickSight to visualize the data.
  • B. Use Amazon QuickSight to visualize the data and then use ML-powered forecasting to forecast the key business metrics.
  • C. Use a pre-build ML AMI from the AWS Marketplace to create forecasts and then use Amazon QuickSight to visualize the data.
  • D. Use calculated fields to create a new forecast and then use Amazon QuickSight to visualize the data.
Answer:

A


Explanation:
Reference: https://aws.amazon.com/blogs/big-data/query-visualize-and-forecast-trufactor-web-session-intelligence-with-aws-
data-exchange/

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 13

A manufacturing company has been collecting IoT sensor data from devices on its factory floor for a year and is storing the
data in Amazon Redshift for daily analysis. A data analyst has determined that, at an expected ingestion rate of about 2 TB
per day, the cluster will be undersized in less than 4 months. A long-term solution is needed. The data analyst has indicated
that most queries only reference the most recent 13 months of data, yet there are also quarterly reports that need to query all
the data generated from the past 7 years. The chief technology officer (CTO) is concerned about the costs, administrative
effort, and performance of a long-term solution.
Which solution should the data analyst use to meet these requirements?

  • A. Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.
  • B. Take a snapshot of the Amazon Redshift cluster. Restore the cluster to a new cluster using dense storage nodes with additional storage capacity.
  • C. Execute a CREATE TABLE AS SELECT (CTAS) statement to move records that are older than 13 months to quarterly partitioned data in Amazon Redshift Spectrum backed by Amazon S3.
  • D. Unload all the tables in Amazon Redshift to an Amazon S3 bucket using S3 Intelligent-Tiering. Use AWS Glue to crawl the S3 bucket location to create external tables in an AWS Glue Data Catalog. Create an Amazon EMR cluster using Auto Scaling for any daily analytics needs, and use Amazon Athena for the quarterly reports, with both using the same AWS Glue Data Catalog.
Answer:

B

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 14

A company launched a service that produces millions of messages every day and uses Amazon Kinesis Data Streams as
the streaming service.
The company uses the Kinesis SDK to write data to Kinesis Data Streams. A few months after launch, a data analyst found
that write performance is significantly reduced. The data analyst investigated the metrics and determined that Kinesis is
throttling the write requests. The data analyst wants to address this issue without significant changes to the architecture.
Which actions should the data analyst take to resolve this issue? (Choose two.)

  • A. Increase the Kinesis Data Streams retention period to reduce throttling.
  • B. Replace the Kinesis API-based data ingestion mechanism with Kinesis Agent.
  • C. Increase the number of shards in the stream using the UpdateShardCount API.
  • D. Choose partition keys in a way that results in a uniform record distribution across shards.
  • E. Customize the application code to include retry logic to improve performance.
Answer:

A C

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 15

A company needs to store objects containing log data in JSON format. The objects are generated by eight applications
running in AWS. Six of the applications generate a total of 500 KiB of data per second, and two of the applications can
generate up to 2 MiB of data per second.
A data engineer wants to implement a scalable solution to capture and store usage data in an Amazon S3 bucket. The
usage data objects need to be reformatted, converted to .csv format, and then compressed before they are stored in
Amazon S3. The company requires the solution to include the least custom code possible and has authorized the data
engineer to request a service quota increase if needed.
Which solution meets these requirements?

  • A. Configure an Amazon Kinesis Data Firehose delivery stream for each application. Write AWS Lambda functions to read log data objects from the stream for each application. Have the function perform reformatting and .csv conversion. Enable compression on all the delivery streams.
  • B. Configure an Amazon Kinesis data stream with one shard per application. Write an AWS Lambda function to read usage data objects from the shards. Have the function perform .csv conversion, reformatting, and compression of the data. Have the function store the output in Amazon S3.
  • C. Configure an Amazon Kinesis data stream for each application. Write an AWS Lambda function to read usage data objects from the stream for each application. Have the function perform .csv conversion, reformatting, and compression of the data. Have the function store the output in Amazon S3.
  • D. Store usage data objects in an Amazon DynamoDB table. Configure a DynamoDB stream to copy the objects to an S3 bucket. Configure an AWS Lambda function to be triggered when objects are written to the S3 bucket. Have the function convert the objects into .csv format.
Answer:

B

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000
To page 2