Databricks Certified Data Engineer Professional - Databricks-Certified-Data-Engineer-Professional Exam Practice Test

Page: 1 / 21
Total 250 questions

Please signup / login to view this exam, then you will be able to view the entire exam for free.

Question 1

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?

A. Modify the overwrite logic to include a field populated by calling
spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

B. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

C. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

D. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

E. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 2

A data engineer is building a Lakeflow Declarative Pipelines pipeline to process healthcare claims data. A metadata JSON file defines data quality rules for multiple tables, including:
{
"claims": [
{"name": "valid_patient_id", "constraint": "patient_id IS NOT NULL"},
{"name": "non_negative_amount", "constraint": "claim_amount >= 0"}
]
}
The pipeline must dynamically apply these rules to the claims table without hardcoding the rules.
How should the data engineer achieve this?

A. Load the JSON metadata, loop through its entries, and apply expectations using dlt.expect_all.

B. Use a SQL CONSTRAINT block referencing the JSON file path.

C. Reference each expectation with @dlt.expect decorators in the table declaration.

D. Invoke an external API to validate records against the metadata rules.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 3

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.
Which statement describes a main benefit that offset this additional effort?

A. Troubleshooting is easier since all steps are isolated and tested individually

B. Ensures that all steps interact correctly to achieve the desired end result

C. Validates a complete use case of your application

D. Improves the quality of your data

E. Yields faster deployment and execution times

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 4

Why are Pandas UDFs often preferred over traditional PySpark UDFs in performance-critical applications involving large datasets?

A. They leverage Apache Arrow to enable vectorized operations between the JVM and Python runtimes, reducing serialization costs and improving computational efficiency.

B. They allow row-level execution of functions in Python with native Spark optimization, removing the need for columnar execution.

C. They eliminate the JVM-Python boundary by bypassing serialization entirely, thereby avoiding data conversion overhead.

D. They minimize memory usage by streaming each row individually through a lightweight Python wrapper, avoiding batch processing overhead.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 5

A data engineering team is migrating off its legacy Hadoop platform. As part of the process, they are evaluating storage formats for performance comparison. The legacy platform uses ORC and RCFile formats. After converting a subset of data to Delta Lake, they noticed significantly better query performance. Upon investigation, they discovered that queries reading from Delta tables leveraged a Shuffle Hash Join, whereas queries on legacy formats used Sort Merge Joins. The queries reading Delta Lake data also scanned less data. Which reason could be attributed to the difference in query performance?

A. The queries against the Delta Lake tables were able to leverage the dynamic file pruning optimization.

B. Delta Lake enables data skipping and file pruning using a vectorized Parquet reader.

C. Shuffle Hash Joins are always more efficient than Sort Merge Joins.

D. The queries against the ORC tables leveraged the dynamic data skipping optimization but not the dynamic file pruning optimization.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 6

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?

A. When tables are created, make sure that the external keyword is used in the create table statement.

B. Whenever a table is being created, make sure that the location keyword is used.

C. When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

D. When the workspace is being configured, make sure that external cloud object storage has been mounted.

E. Whenever a database is being created, make sure that the location keyword is used

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 7

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

A. Cmd 3

B. Cmd 5

C. Cmd 2

D. Cmd 4

E. Cmd 6

Discussion 0

Correct Answer: E Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 8

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.
A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays.
Which limitation will the team face while diagnosing this problem?

A. Updating the table schema requires a default value provided for each file added.

B. New fields cannot be added to a production Delta table.

C. New fields will not be computed for historic records.

D. Updating the table schema will invalidate the Delta transaction log metadata.

E. Spark cannot capture the topic partition fields from the kafka source.

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 9

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.
The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.
Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

A. Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

B. Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

C. Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

D. Create a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.

E. Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 10

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
Which code block accomplishes this task while minimizing potential compute costs?

A. preds.write.format("delta").save("/preds/churn_preds")

B. preds.write.mode("append").saveAsTable("churn_preds")

Discussion 0

Correct Answer: B Vote an answer

Question 11

A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:
user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.
Which solution minimizes the compute costs to propagate this batch of data?

A. Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.

B. Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.

C. Reprocess all records in reviews_raw and overwrite the next table in the pipeline.

D. Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.

E. Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Question 12

A data engineer is using the AUTO CDC API in Lakeflow Spark Declarative Pipeline to propagate deletions from a source table (orders_source) to a target table (orders_target). The source has Change Data Feed (CDF) enabled, but some delete events arrive out of order due to upstream delays. How does the AUTO CDC API internally ensure deletions are applied correctly despite out-of-order events?

A. It uses sequence_by to order events and retains tombstones for deleted rows until older sequences are processed.

B. It runs VACUUM on the target table to purge conflicting records.

C. It manually sorts incoming events by timestamp before applying changes.

D. It ignores deletions if they arrive after updates for the same key.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Fast2test members. You can sign-up / login (it's free).

Page: 1 / 21
Total 250 questions

Unlock all Databricks-Certified-Data-Engineer-Professional features

No captcha needed
365 Days Free Updates
Set your Desired Pass Percentage
Allocate Time (Hours : Minutes)
Two Modes For Databricks-Certified-Data-Engineer-Professional Practice
Customer Support

Get Full Access Now

Contact Us

If you have any question please leave me your email address, we will reply and send email to you in 12 hours.

Our Working Time: ( GMT 0:00-15:00 ) From Monday to Saturday

Support: Contact now

日本語 Deutsch 繁体中文 한국어

Useful Links

All Products
FAQ
Privacy Policy
Guarantee & Refund Policy
How to buy?
About Us

Latest Updated

AP-222 Premium File
CRISC Premium File
C_CR125 Premium File
C_ARP2P Premium File
DevOps-Leader Premium File
D-VXR-DY-01 Premium File