databricks certified data engineer professional practice test

certified data engineer professional

Last exam update: Oct 11 ,2024
Page 1 out of 10. Viewing questions 1-10 out of 110

Question 1

Which statement describes Delta Lake optimized writes?

  • A. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
  • B. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
  • C. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
  • D. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
  • E. A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.
Answer:

e

User Votes:
A
50%
B 1 votes
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 2

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

  • A. configure
  • B. fs
  • C. jobs
  • D. libraries
  • E. workspace
Answer:

c

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 3

A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:

user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING

The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.

Which solution minimizes the compute costs to propagate this batch of data?

  • A. Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.
  • B. Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.
  • C. Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
  • D. Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
  • E. Reprocess all records in reviews_raw and overwrite the next table in the pipeline.
Answer:

c

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 4

A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?

  • A. Use Repos to checkout all changes and send the git diff log to the team.
  • B. Use Repos to create a fork of the remote repository, commit all changes, and make a pull request on the source repository.
  • C. Use Repos to pull changes from the remote Git repository; commit and push changes to a branch that appeared as changes were pulled.
  • D. Use Repos to merge all differences and make a pull request back to the remote repository.
  • E. Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.
Answer:

e

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 5

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

  • A. Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
  • B. Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
  • C. Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.
  • D. Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
  • E. Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.
Answer:

b

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 6

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

  • A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
  • B. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.
  • C. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
  • D. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
  • E. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 7

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these events?

  • A. Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identify these events.
  • B. Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.
  • C. Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.
  • D. Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.
  • E. Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.
Answer:

c

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 8

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:



Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.

If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

  • A. Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
  • B. All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
  • C. The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
  • D. Duplicate records arriving more than 2 hours apart will be dropped, but duplicates that arrive in the same batch may both be written to the orders table.
  • E. The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 9

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

  • A. DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
  • B. By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
  • C. The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
  • D. Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
  • E. The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.
Answer:

e

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 10

Which statement regarding Spark configuration on the Databricks platform is true?

  • A. The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster.
  • B. Spark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster.
  • C. Spark configuration properties can only be set for an interactive cluster by creating a global init script.
  • D. Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.
  • E. When the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000
To page 2