Validate your Databricks-Certified-Professional-Data-Engineer Exam Preparation with Databricks-Certified-Professional-Data-Engineer Practice Test (Online & Offline)
Get all the Information About Databricks Databricks-Certified-Professional-Data-Engineer Exam 2023 Practice Test Questions
Databricks Certified Professional Data Engineer Exam is a certification program designed for data professionals who want to demonstrate their expertise in building, deploying, and maintaining data engineering solutions using Databricks. Databricks-Certified-Professional-Data-Engineer exam covers a wide range of topics related to data engineering and requires a thorough understanding of Databricks data engineering concepts and techniques. Databricks-Certified-Professional-Data-Engineer exam is challenging and requires the candidate to demonstrate their ability to perform specific tasks using Databricks.
NEW QUESTION # 12
A data engineering team is in the process of converting their existing data pipeline to utilize Auto Loader for
incremental processing in the ingestion of JSON files. One data engineer comes across the following code
block in the Auto Loader documentation:
1. (streaming_df = spark.readStream.format("cloudFiles")
2. .option("cloudFiles.format", "json")
3. .option("cloudFiles.schemaLocation", schemaLocation)
4. .load(sourcePath))
Assuming that schemaLocation and sourcePath have been set correctly, which of the following changes does
the data engineer need to make to convert this code block to use Auto Loader to ingest the data?
- A. The data engineer needs to add the .autoLoader line before the .load(sourcePath) line
- B. There is no change required. The data engineer needs to ask their administrator to turn on Auto Loader
- C. There is no change required. Databricks automatically uses Auto Loader for streaming reads
- D. The data engineer needs to change the format("cloudFiles") line to format("autoLoader")
- E. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader
Answer: E
NEW QUESTION # 13
You are asked to debug a databricks job that is taking too long to run on Sunday's, what are the steps you are going to take to identify the step that is taking longer to run?
- A. Once a job is launched, you cannot access the job's notebook activity.
- B. Enable debug mode in the Jobs to see the output activity of a job, output should be available to view.
- C. Use the compute's spark UI to monitor the job activity.
- D. A notebook activity of job run is only visible when using all-purpose cluster.
- E. Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed.
Answer: E
Explanation:
Explanation
The answer is, Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed.
You have the ability to view current active runs or completed runs, once you click the run you can see the A picture containing graphical user interface Description automatically generated
Click on the run to view the notebook output
Graphical user interface, text, application, email Description automatically generated
NEW QUESTION # 14
You have configured AUTO LOADER to process incoming IOT data from cloud object storage every 15 mins, recently a change was made to the notebook code to update the processing logic but the team later realized that the notebook was failing for the last 24 hours, what steps team needs to take to reprocess the data that was not loaded after the notebook was corrected?
- A. Move the files that were not processed to another location and manually copy the files into the ingestion path to reprocess them
- B. Autoloader automatically re-processes data that was not loaded
- C. Manually re-load the data
- D. Delete the checkpoint folder and run the autoloader again
- E. Enable back_fill = TRUE to reprocess the data
Answer: B
Explanation:
Explanation
The answer is,
Autoloader automatically re-processes data that was not loaded using the checkpoint.
NEW QUESTION # 15
Consider flipping a coin for which the probability of heads is p, where p is unknown, and our goa is to
estimate p. The obvious approach is to count how many times the coin came up heads and divide by the total
number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very reasonable to
estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get heads both times.
Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it seems a bit
rash to conclude that the coin will always come up heads, and____________is a way of avoiding such rash
conclusions.
- A. Linear Regression
- B. Naive Bayes
- C. Laplace Smoothing
- D. Logistic Regression
Answer: C
Explanation:
Explanation
Smooth the estimates:consider flipping a coin for which the probability of heads is p, where p is unknown, and
our goal is to estimate p. The obvious approach is to count how many times the coin came up heads and divide
by the total number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very
reasonable to estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get
heads both times. Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it
seems a bit rash to conclude that the coin will always come up heads, and smoothing is a way of avoiding such
rash conclusions. A simple smoothing method, called Laplace smoothing (or Laplace's law of succession or
add-one smoothing in R&N), is to estimate p by (one plus the number of heads) / (two plus the total number of
flips). Said differently, if we are keeping count of the number of heads and the number of tails, this rule is
equivalent to starting each of our counts at one, rather than zero. Another advantage of Laplace smoothing is
that it avoids estimating any probabilities to be zero, even for events never observed in the data. Laplace
add-one smoothing now assigns too much probability to unseen words
NEW QUESTION # 16
You were asked to identify number of times a temperature sensor exceed threshold temperature (100.00) by each device, each row contains 5 readings collected every 5 minutes, fill in the blank with the appropriate functions.
Schema: deviceId INT, deviceTemp ARRAY<double>, dateTimeCollected TIMESTAMP
SELECT deviceId, __ (__ (__(deviceTemp], i -> i > 100.00)))
FROM devices
GROUP BY deviceId
- A. SUM, COUNT, SIZE
- B. SUM, SIZE, FILTER
- C. SUM, SIZE, ARRAY_FILTER
- D. SUM, SIZE, ARRAY_CONTAINS
- E. SUM, SIZE, SLICE
Answer: B
Explanation:
Explanation
FILER function can be used to filter an array based on an expression
SIZE function can be used to get size of an array
SUM is used to calculate to total by device
Diagram Description automatically generated
NEW QUESTION # 17
Which of the following type of tasks cannot setup through a job?
- A. Python
- B. Databricks SQL Dashboard refresh
- C. Notebook
- D. Spark Submit
- E. DELTA LIVE PIPELINE
Answer: B
NEW QUESTION # 18
A dataset has been defined using Delta Live Tables and includes an expectations clause: CON-STRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') What is the expected behavior when a batch of data containing data that violates these constraints is processed?
- A. Records that violate the expectation cause the job to fail.
- B. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset.
- C. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
- D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
- E. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
Answer: D
Explanation:
Explanation
The answer is, Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations, Diagram Description automatically generated with medium confidence
NEW QUESTION # 19
When working with AUTO LOADER you noticed that most of the columns that were inferred as part of loading are string data types including columns that were supposed to be integers, how can we fix this?
- A. Update the checkpoint location
- B. Provide the schema of the source table in the cloudfiles.schemalocation
- C. Provide schema hints
- D. Correct the incoming data by explicitly casting the data types
- E. Provide the schema of the target table in the cloudfiles.schemalocation
Answer: C
Explanation:
Explanation
The answer is, Provide schema hints.
1.spark.readStream \
2.format("cloudFiles") \
3.option("cloudFiles.format", "csv") \
4.option("header", "true") \
5.option("cloudFiles.schemaLocation", schema_location) \
6.option("cloudFiles.schemaHints", "id int, description string")
7.load(raw_data_location)
8.writeStream \
9.option("checkpointLocation", checkpoint_location) \
10.start(target_delta_table_location)option("cloudFiles.schemaHints", "id int, description string")
# Here we are providing a hint that id column is int and the description is a string When cloudfiles.schemalocation is used to store the output of the schema inference during the load process, with schema hints you can enforce data types for known columns ahead of time.
NEW QUESTION # 20
A data engineer has set up two Jobs that each run nightly. The first Job starts at 12:00 AM, and it usually
completes in about 20 minutes. The second Job depends on the first Job, and it starts at 12:30 AM. Sometimes,
the second Job fails when the first Job does not complete by 12:30 AM.
Which of the following approaches can the data engineer use to avoid this problem?
- A. They can set up the data to stream from the first Job to the second Job
- B. They can set up a retry policy on the first Job to help it run more quickly
- C. They can use cluster pools to help the Jobs run more efficiently
- D. They can limit the size of the output in the second Job so that it will not fail as easily
- E. They can utilize multiple tasks in a single job with a linear dependency
Answer: E
NEW QUESTION # 21
Kevin is the owner of both the sales table and regional_sales_vw view which uses the sales table as the underlying source for the data, and Kevin is looking to grant select privilege on the view regional_sales_vw to one of newly joined team members Steven. Which of the following is a true statement?
- A. Kevin can not grant access to Steven since he does not have security admin privilege
- B. Kevin can not grant access to Steven since he does have workspace admin privilege
- C. Kevin can grant access to the view, because he is the owner of the view and the under-lying table
- D. Kevin although is the owner but does not have ALL PRIVILEGES permission
- E. Steve will also require SELECT access on the underlying table
Answer: C
Explanation:
Explanation
The answer is, Kevin can grant access to the view, because he is the owner of the view and the un-derlying table, Ownership determines whether or not you can grant privileges on derived objects to other users, a user who creates a schema, table, view, or function becomes its owner. The owner is granted all privileges and can grant privileges to other users
NEW QUESTION # 22
The research team has put together a funnel analysis query to monitor the customer traffic on the e-commerce platform, the query takes about 30 mins to run on a small SQL endpoint cluster with max scaling set to 1 cluster. What steps can be taken to improve the performance of the query?
- A. They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance Policy from
"Cost optimized" to "Reliability Optimized." - B. They can turn off the Auto Stop feature for the SQL endpoint to more than 30 mins.
- C. They can turn on the Serverless feature for the SQL endpoint.
- D. They can increase the maximum bound of the SQL endpoint's scaling range anywhere from between 1 to 100 to review the performance and select the size that meets the re-quired SLA.
- E. They can increase the cluster size anywhere from X small to 3XL to review the per-formance and select the size that meets the required SLA.
Answer: E
Explanation:
Explanation
The answer is, They can increase the cluster size anywhere from 2X-Small to 4XL(Scale Up) to review the performance and select the size that meets your SLA. If you are trying to improve the performance of a single query at a time having additional memory, additional worker nodes mean that more tasks can run in a cluster which will improve the performance of that query.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below diagram to understand )
1.A SQL Warehouse should have at least one cluster
2.A cluster comprises one driver node and one or many worker nodes
3.No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1 worker, X-Small ->2 workers.... up to 4X-Large -> 128 workers) this is called Scale Up
4.A single cluster irrespective of cluster size(2X-Smal.. to ...4XLarge) can only run 10 queries at any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and cluster scaling (min
1, max1) while 10 queries will start running the remaining 10 queries wait in a queue for these 10 to finish.
5.Increasing the Warehouse cluster size can improve the performance of a query, example if a query runs for 1 minute in a 2X-Small warehouse size, it may run in 30 Seconds if we change the warehouse size to X-Small.
this is due to 2X-Small has 1 worker node and X-Small has 2 worker nodes so the query has more tasks and runs faster (note: this is an ideal case example, the scalability of a query performance depends on many factors, it can not always be linear)
6.A warehouse can have more than one cluster this is called Scale Out. If a warehouse is configured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2 clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start running and holds the remaining in the queue and databricks will automatically start the second cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
7.A single query will not span more than one cluster, once a query is submitted to a cluster it will remain in that cluster until the query execution finishes irrespective of how many clusters are available to scale.
Please review the below diagram to understand the above concepts:
Scale-up-> Increase the size of the SQL endpoint, change cluster size from 2X-Small to up to 4X-Large If you are trying to improve the performance of a single query having additional memory, additional worker nodes and cores will result in more tasks running in the cluster will ultimately improve the performance.
During the warehouse creation or after, you have the ability to change the warehouse size (2X-Small....to
...4XLarge) to improve query performance and the maximize scaling range to add more clusters on a SQL Endpoint(SQL Warehouse) scale-out if you are changing an existing warehouse you may have to restart the warehouse to make the changes effective.
NEW QUESTION # 23
Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data discovery?
- A. DELTA LIVE Pipelines
- B. Lakehouse
- C. Data Governance
- D. DELTA lake
- E. Unity Catalog
Answer: E
NEW QUESTION # 24
What is the underlying technology that makes the Auto Loader work?
- A. Live DataFames
- B. DataFrames
- C. Structured Streaming
- D. Delta Live Tables
- E. Loader
Answer: C
NEW QUESTION # 25
The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?
- A. Can Manage
- B. No permissions
- C. Can Read
- D. Can Edit
- E. Can Run
Answer: C
Explanation:
Explanation
This is the correct answer because it is the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data. Notebook permissions are used to control access to notebooks in Databricks workspaces. There are four types of notebook permissions: Can Manage, Can Edit, Can Run, and Can Read. Can Manage allows full control over the notebook, including editing, running, deleting, exporting, and changing permissions. Can Edit allows modifying and running the notebook, but not changing permissions or deleting it. Can Run allows executing commands in an existing cluster attached to the notebook, but not modifying or exporting it. Can Read allows viewing the notebook content, but not running or modifying it. In this case, granting Can Read permission to the user will allow them to review the production logic in the notebook without allowing them to makeany changes to it or run any commands that may affect production data. Verified References: [Databricks Certified Data Engineer Professional], under "Databricks Workspace" section; Databricks Documentation, under "Notebook permissions" section.
NEW QUESTION # 26
A new user who currently does not have access to the catalog or schema is requesting access to the customer table in sales schema, but the customer table contains sensitive information, so you have decided to create view on the table excluding columns that are sensitive and granted access to the view GRANT SELECT ON view_name to [email protected] but when the user tries to query the view, gets the error view does not exist. What is the issue preventing user to access the view and how to fix it?
- A. User needs ADMIN privilege on the view
- B. User has to be the owner of the view
- C. User requires to be put in a special group that has access to PII data
- D. User requires SELECT on the underlying table
- E. User requires USAGE privilege on Sales schema
Answer: E
Explanation:
Explanation
The answer is User requires USAGE privilege on Sales schema,
Data object privileges - Azure Databricks | Microsoft Docs
GRANT USAGE ON SCHEMA sales TO [email protected];
*USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema object.
NEW QUESTION # 27
......
Check Real Databricks Databricks-Certified-Professional-Data-Engineer Exam Question for Free (2023): https://pass4sure.practicedump.com/Databricks-Certified-Professional-Data-Engineer-exam-questions.html