D. Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.
A central government organization is collecting events from various internal applications using Amazon Managed Streaming
for Apache Kafka (Amazon MSK). The organization has configured a separate Kafka topic for each application to separate
the data. For security reasons, the Kafka cluster has been configured to only allow TLS encrypted data and it encrypts the
data at rest.
A recent application update showed that one of the applications was configured incorrectly, resulting in writing data to a
Kafka topic that belongs to another application. This resulted in multiple errors in the analytics pipeline as data from different
applications appeared on the same topic. After this incident, the organization wants to prevent applications from writing to a
topic different than the one they should write to.
Which solution meets these requirements with the least amount of effort?
B
A data analytics specialist is building an automated ETL ingestion pipeline using AWS Glue to ingest compressed files that
have been uploaded to an Amazon S3 bucket. The ingestion pipeline should support incremental data processing.
Which AWS Glue feature should the data analytics specialist use to meet this requirement?
B
Explanation:
Reference: https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-
incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html
A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The
company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon
Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants
to optimize the files to reduce the cost of querying and also improve the speed of data loading into the Amazon Redshift
cluster.
Which solution meets these requirements?
D
D. Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.
A mobile gaming company wants to capture data from its gaming app and make the data available for analysis immediately.
The data record size will be approximately 20 KB. The company is concerned about achieving optimal throughput from each
device. Additionally, the company wants to develop a data stream processing application with dedicated throughput for each
consumer.
Which solution would achieve this goal?
D
A company has 10-15 of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-
time query engine. The company wants to transform the data to optimize query runtime and storage costs.
Which option for data format and compression meets these requirements?
B
Explanation:
Reference: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
As per the link For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable.
A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time
solution that can ingest the data securely at scale. The solution should also be able to remove the patients protected health
information (PHI) from the streaming data and store the data in durable storage.
Which solution meets these requirements with the least operational overhead?
C
Explanation:
Reference: https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-
and-aws-lambda/
A company is building a data lake and needs to ingest data from a relational database that has time-series data. The
company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental
data only from the source into Amazon S3.
What is the MOST cost-effective approach to meet these requirements?
B
A company has a business unit uploading .csv files to an Amazon S3 bucket. The companys data platform team has set up
an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the
created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon
Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into
the Amazon Redshift table.
Which solution will update the Redshift table without duplicates when jobs are rerun?
B
A company uses Amazon Redshift as its data warehouse. A new table includes some columns that contain sensitive data
and some columns that contain non-sensitive data. The data in the table eventually will be referenced by several existing
queries that run many times each day.
A data analytics specialist must ensure that only members of the companys auditing team can read the columns that contain
sensitive data. All other users must have read-only access to the columns that contain non-sensitive data.
Which solution will meet these requirements with the LEAST operational overhead?
D
Explanation:
Users with SELECT permission on a table can view the table data. Columns that are defined as masked, will display the
masked data. Grant the UNMASK permission to a user to enable them to retrieve unmasked data from the columns for
which masking is defined.
Reference: https://docs.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15
A banking company is currently using Amazon Redshift for sensitive data. An audit found that the current cluster is
unencrypted. Compliance requires that a database with sensitive data must be encrypted using a hardware security module
(HSM) with customer managed keys.
Which modifications are required in the cluster to ensure compliance?
A
Explanation:
When you modify your cluster to enable AWS KMS encryption, Amazon Redshift automatically migrates your data to a new
encrypted cluster.
Reference: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html
A company has developed several AWS Glue jobs to validate and transform its data from Amazon S3 and load it into
Amazon RDS for MySQL in batches once every day. The ETL jobs read the S3 data using a DynamicFrame. Currently, the
ETL developers are experiencing challenges in processing only the incremental data on every run, as the AWS Glue job
processes all the S3 input data on each run.
Which approach would allow the developers to solve the issue with minimal coding effort?
D
A company wants to use an automatic machine learning (ML) Random Cut Forest (RCF) algorithm to visualize complex real-
world scenarios, such as detecting seasonality and trends, excluding outers, and imputing missing values.
The team working on this project is non-technical and is looking for an out-of-the-box solution that will require the LEAST
amount of management overhead.
Which solution will meet these requirements?
A
Explanation:
Reference: https://aws.amazon.com/blogs/big-data/query-visualize-and-forecast-trufactor-web-session-intelligence-with-aws-
data-exchange/
A manufacturing company has been collecting IoT sensor data from devices on its factory floor for a year and is storing the
data in Amazon Redshift for daily analysis. A data analyst has determined that, at an expected ingestion rate of about 2 TB
per day, the cluster will be undersized in less than 4 months. A long-term solution is needed. The data analyst has indicated
that most queries only reference the most recent 13 months of data, yet there are also quarterly reports that need to query all
the data generated from the past 7 years. The chief technology officer (CTO) is concerned about the costs, administrative
effort, and performance of a long-term solution.
Which solution should the data analyst use to meet these requirements?
B
A company launched a service that produces millions of messages every day and uses Amazon Kinesis Data Streams as
the streaming service.
The company uses the Kinesis SDK to write data to Kinesis Data Streams. A few months after launch, a data analyst found
that write performance is significantly reduced. The data analyst investigated the metrics and determined that Kinesis is
throttling the write requests. The data analyst wants to address this issue without significant changes to the architecture.
Which actions should the data analyst take to resolve this issue? (Choose two.)
A C
A company needs to store objects containing log data in JSON format. The objects are generated by eight applications
running in AWS. Six of the applications generate a total of 500 KiB of data per second, and two of the applications can
generate up to 2 MiB of data per second.
A data engineer wants to implement a scalable solution to capture and store usage data in an Amazon S3 bucket. The
usage data objects need to be reformatted, converted to .csv format, and then compressed before they are stored in
Amazon S3. The company requires the solution to include the least custom code possible and has authorized the data
engineer to request a service quota increase if needed.
Which solution meets these requirements?
B