
Associate-Developer-Apache-Spark Exam Questions Get Updated [2023] with Correct Answers
Practice Associate-Developer-Apache-Spark Questions With Certification guide Q&A from Training Expert Fast2test
The Databricks Associate-Developer-Apache-Spark certification exam is an industry-recognized certification that validates the skills of developers in the Apache Spark framework. The certification exam covers a range of topics related to Apache Spark, including data processing, data pipelines, machine learning, and more. The certification exam is an excellent opportunity for developers to showcase their skills and knowledge of the Apache Spark framework and advance their careers in the field of big data and data processing.
Databricks Associate-Developer-Apache-Spark is a certification exam that validates the skills and knowledge of developers in Apache Spark. Apache Spark is a popular open-source big data processing engine that offers faster processing speed and improved performance over traditional Hadoop MapReduce. The certification exam is designed to test the candidate's ability to work with Spark Core, Spark SQL, Spark Streaming, and MLlib.
NEW QUESTION # 24
Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.
Sample of DataFrame itemsDf:
1.+------+--------------------+--------------------+-------------------+
2.|itemId| itemName| attributes| supplier|
3.+------+--------------------+--------------------+-------------------+
4.| 1|Thick Coat for Wa...|[blue, winter, cozy]|Sports Company Inc.|
5.| 2|Elegant Outdoors ...|[red, summer, fre...| YetiX|
6.| 3| Outdoors Backpack|[green, summer, t...|Sports Company Inc.|
7.+------+--------------------+--------------------+-------------------+
- A. itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()
- B. itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()
- C. itemsDf.select(~col('supplier').contains('X')).distinct()
- D. itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()
- E. itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()
Answer: A
Explanation:
Explanation
Output of correct code block:
+-------------------+
| supplier|
+-------------------+
|Sports Company Inc.|
+-------------------+
Key to managing this question is understand which operator to use to do the opposite of an operation
- the ~ (not) operator. In addition, you should know that there is no unique() method.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 25
The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))
- A. 1. size
2. regexp_extract
3. lower
4. col("itemName")
5. alias - B. 1. length
2. regexp_extract
3. upper
4. col("itemName")
5. as - C. 1. lower
2. regexp_replace
3. length
4. "itemName"
5. alias - D. 1. length
2. regexp_replace
3. lower
4. col("itemName")
5. alias - E. 1. size
2. regexp_replace
3. lower
4. "itemName"
5. alias
Answer: D
Explanation:
Explanation
Correct code block:
itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) Returned DataFrame:
+------------+
|consonant_ct|
+------------+
| 19|
| 16|
| 10|
+------------+
This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u|
\s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam.
The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation.
The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct.
More info:
- lower: pyspark.sql.functions.lower - PySpark 3.1.2 documentation
- regexp_replace: pyspark.sql.functions.regexp_replace - PySpark 3.1.2 documentation
- length: pyspark.sql.functions.length - PySpark 3.1.2 documentation
- alias: pyspark.sql.Column.alias - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 26
Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has
10 partitions?
- A. transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
- B. transactionsDf.repartition(transactionsDf._partitions+2)
- C. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
- D. transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
- E. transactionsDf.coalesce(10)
Answer: A
Explanation:
Explanation
transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
Correct. The repartition operator is the correct one for increasing the number of partitions. calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.
transactionsDf.coalesce(10)
No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.
transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
Incorrect, there is no getNumPartitions() method for the DataFrame class.
transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.
transactionsDf.repartition(transactionsDf._partitions+2)
No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.
More info: pyspark.sql.DataFrame.repartition - PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 27
Which of the following describes tasks?
- A. A task is a command sent from the driver to the executors in response to a transformation.
- B. Tasks get assigned to the executors by the driver.
- C. Tasks transform jobs into DAGs.
- D. A task is a collection of rows.
- E. A task is a collection of slots.
Answer: B
Explanation:
Explanation
Tasks get assigned to the executors by the driver.
Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver.
Tasks transform jobs into DAGs.
No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more tasks.
A task is a collection of rows.
Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.
A task is a command sent from the driver to the executors in response to a transformation.
Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors only in response to actions.
A task is a collection of slots.
No. Executors have one or more slots to process tasks and each slot can be assigned a task.
NEW QUESTION # 28
The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column storeId as key for partitioning. Find the error.
Code block:
transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_split")A.
- A. Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
- B. The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").
- C. Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
- D. partitionOn("storeId") should be called before the write operation.
- E. The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
Answer: A
Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_split") More info: partition by - Reading files which are written using PartitionBy or BucketBy in Spark - Stack Overflow Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 29
Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?
- A. transactionsDf.unpersist()
(Correct) - B. transactionsDf.clearCache()
- C. array_remove(transactionsDf, "*")
- D. del transactionsDf
- E. transactionsDf.persist()
Answer: A
Explanation:
Explanation
transactionsDf.unpersist()
Correct. The DataFrame.unpersist() command does exactly what the question asks for - it removes all cached parts of the DataFrame from memory and disk.
del transactionsDf
False. While this option can help remove the DataFrame from memory and disk, it does not do so immediately. The reason is that this command just notifies the Python garbage collector that the transactionsDf now may be deleted from memory. However, the garbage collector does not do so immediately and, if you wanted it to run immediately, would need to be specifically triggered to do so. Find more information linked below.
array_remove(transactionsDf, "*")
Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays in columns that match a specific condition. Also, the first argument would be a column, and not a DataFrame as shown in the code block.
transactionsDf.persist()
No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame transactionsDf to memory and disk. Note that even though you do not pass in a specific storage level here, Spark will use the default storage level (MEMORY_AND_DISK).
transactionsDf.clearCache()
Wrong. Spark's DataFrame does not have a clearCache() method.
More info: pyspark.sql.DataFrame.unpersist - PySpark 3.1.2 documentation, python - How to delete an RDD in PySpark for the purpose of releasing resources? - Stack Overflow Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 30
Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+
- A. transactionsDf.drop(["predError", "value"])
- B. transactionsDf.drop([col("predError"), col("value")])
- C. transactionsDf.drop(col("value"), col("predError"))
- D. transactionsDf.drop("predError", "value")
- E. transactionsDf.drop(value, predError)
Answer: D
Explanation:
Explanation
Output of correct code block:
+-------------+-------+---------+----+
|transactionId|storeId|productId| f|
+-------------+-------+---------+----+
| 1| 25| 1|null|
| 2| 2| 2|null|
| 3| 25| 3|null|
+-------------+-------+---------+----+
To solve this question, you should be fmailiar with the drop() API. The order of column names does not matter
- in this question the order differs in some answers just to confuse you. Also, drop() does not take a list. The *cols operator in the documentation means that all arguments passed to drop() are interpreted as column names.
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 31
Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?
- A. transactionsDf.sort(asc(value)).show(10)
- B. transactionsDf.sort(col("value")).show(10)
- C. transactionsDf.sort(col("value").desc()).head()
- D. transactionsDf.orderBy("value").asc().show(10)
- E. transactionsDf.sort(col("value").asc()).print(10)
Answer: B
Explanation:
Explanation
show() is the correct method to look for here, since the question specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):
+-------------+---------+-----+-------+---------+----+---------------+
|transactionId|predError|value|storeId|productId| f|transactionDate|
+-------------+---------+-----+-------+---------+----+---------------+
| 3| 3| 1| 25| 3|null| 1585824821|
| 5| null| 2| null| 2|null| 1575285427|
| 4| null| 3| 3| 2|null| 1583244275|
+-------------+---------+-----+-------+---------+----+---------------+
With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:
- transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)
- transactionsDf.sort(asc(col("value")))
- transactionsDf.sort(asc("value"))
- transactionsDf.sort(transactionsDf.value.asc())
- transactionsDf.sort(transactionsDf.value)
Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 32
Which of the following statements about RDDs is incorrect?
- A. The high-level DataFrame API is built on top of the low-level RDD API.
- B. An RDD consists of a single partition.
- C. RDDs are great for precisely instructing Spark on how to do a query.
- D. RDDs are immutable.
- E. RDD stands for Resilient Distributed Dataset.
Answer: B
Explanation:
Explanation
An RDD consists of a single partition.
Quite the opposite: Spark partitions RDDs and distributes the partitions across multiple nodes.
NEW QUESTION # 33
The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__(__4__)
- A. 1. filter
2. col("storeId")==25
3. collect
4. 5 - B. 1. filter
2. "storeId"==25
3. collect
4. 5 - C. 1. filter
2. col("storeId")==25
3. toLocalIterator
4. 5 - D. 1. select
2. storeId==25
3. head
4. 5 - E. 1. filter
2. col("storeId")==25
3. take
4. 5
Answer: E
Explanation:
Explanation
The correct code block is:
transactionsDf.filter(col("storeId")==25).take(5)
Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.
The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.
The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col("storeId")==25.
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/24.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION # 34
Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+
- A. itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))
- B. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contain
- C. itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").co
- D. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))
- E. itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))
Answer: B
Explanation:
Explanation
Result of correct code block:
+-------------------+
|attributes_exploded|
+-------------------+
| winter|
| cooling|
+-------------------+
To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below).
Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 35
The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before
2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.
Schema:
1.root
2. |-- itemId: integer (nullable = true)
3. |-- attributes: array (nullable = true)
4. | |-- element: string (containsNull = true)
5. |-- supplier: string (nullable = true)
Code block:
1.schema = StructType([
2. StructType("itemId", IntegerType(), True),
3. StructType("attributes", ArrayType(StringType(), True), True),
4. StructType("supplier", StringType(), True)
5.])
6.
7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)
- A. Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
- B. The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
- C. Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
- D. The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
- E. Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
Answer: A
Explanation:
Explanation
Correct code block:
schema = StructType([
StructField("itemId", IntegerType(), True),
StructField("attributes", ArrayType(StringType(), True), True),
StructField("supplier", StringType(), True)
])
spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath) This question is more difficult than what you would encounter in the exam. In the exam, for this question type, only one error needs to be identified and not "one or multiple" as in the question.
Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the question is wrong.
The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original code block (see documentation linked below).
Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for example, DataFrameReader.parquet().
Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be nullable and this is specified correctly by the third argument being True in the schema in the code block.
It is correct, however, that the modification date threshold is specified incorrectly (see above).
The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer above. In addition, the DataFrameReader is called correctly through the SparkSession spark.
Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.
The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified incorrectly (see correct answer above).
NEW QUESTION # 36
Which of the following describes a difference between Spark's cluster and client execution modes?
- A. In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
- B. In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.
- C. In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.
- D. In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
- E. In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
Answer: D
Explanation:
Explanation
In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
Correct. The idea of Spark's client mode is that workloads can be executed from an edge node, also known as gateway machine, from outside the cluster. The most common way to execute Spark however is in cluster mode, where the driver resides on a worker node.
In practice, in client mode, there are tight constraints about the data transfer speed relative to the data transfer speed between worker nodes in the cluster. Also, any job in that is executed in client mode will fail if the edge node fails. For these reasons, client mode is usually not used in a production environment.
In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client execution mode.
No. In both execution modes, the cluster manager may reside on a worker node, but it does not reside on an edge node in client mode.
In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
This is incorrect. Only the driver runs on gateway nodes (also known as "edge nodes") in client mode, but not the executor processes.
In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.
No, in client mode, the Spark driver is not co-located with the driver. The whole point of client mode is that the driver is outside the cluster and not associated with the resource that manages the cluster (the machine that runs the cluster manager).
In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
No, it is exactly the opposite: There are no gateway machines in cluster mode, but in client mode, they host the driver.
NEW QUESTION # 37
Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?
- A. 1.spark.udf.register("LIMIT_FCN", to_limit)
2.spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf") - B. 1.spark.udf.register("LIMIT_FCN", to_limit)
2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result") - C. 1.spark.udf.register(to_limit, "LIMIT_FCN")
2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") - D. 1.spark.udf.register("LIMIT_FCN", to_limit)
2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") (Correct)
Answer: D
Explanation:
Explanation
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") Correct! First, you have to register to_limit as UDF to use it in a sql statement. Then, you can use it under the LIMIT_FCN name, correctly naming the resulting column result.
spark.udf.register(to_limit, "LIMIT_FCN")
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") No. In this answer, the arguments to spark.udf.register are flipped.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") Wrong, this answer does not use the registered LIMIT_FCN in the sql statement, but tries to access the to_limit method directly. This will fail, since Spark cannot access it.
spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf") Incorrect, there is no udf method in Spark's SQL.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result") False. In this answer, the column that results from applying the UDF is not correctly renamed to result.
Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 38
Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below?
Code block:
1.json_schema = """
2.{"type": "struct",
3. "fields": [
4. {
5. "name": "itemId",
6. "type": "integer",
7. "nullable": true,
8. "metadata": {}
9. },
10. {
11. "name": "supplier",
12. "type": "string",
13. "nullable": true,
14. "metadata": {}
15. }
16. ]
17.}
18."""
- A. spark.read.json(filePath, schema=json_schema)
- B. spark.read.json(filePath, schema=spark.read.json(json_schema))
- C. spark.read.schema(json_schema).json(filePath)
1.schema = StructType.fromJson(json.loads(json_schema))
2.spark.read.json(filePath, schema=schema) - D. spark.read.json(filePath, schema=schema_of_json(json_schema))
Answer: C
Explanation:
Explanation
Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this question is beneficial to your exam preparation, since it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in - a topic within the scope of the exam.
The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.
With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.
The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.
Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator's documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string '{a: 1}' to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.
In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.
Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType - exactly the type which the schema parameter of spark.read.json expects.
Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.
More info:
- pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation
- pyspark.sql.functions.schema_of_json - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 39
Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?
- A. itemsDf.persist(StorageLevel.MEMORY_ONLY)
- B. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
- C. itemsDf.write.option('destination', 'memory').save()
- D. itemsDf.store()
- E. itemsDf.cache()
Answer: E
Explanation:
Explanation
The key to solving this question is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.
If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 40
......
The Databricks Associate-Developer-Apache-Spark certification exam is a valuable credential for individuals who want to demonstrate their expertise in developing and deploying Spark applications using Databricks. The certification can help individuals stand out in the job market and increase their earning potential. Additionally, the certification can help organizations identify individuals who have the skills and knowledge to develop and deploy Spark applications using Databricks.
Prepare Top Databricks Associate-Developer-Apache-Spark Exam Audio Study Guide Practice Questions Edition: https://www.fast2test.com/Associate-Developer-Apache-Spark-premium-file.html
Free Databricks Associate-Developer-Apache-Spark Test Practice Test Questions Exam Dumps: https://drive.google.com/open?id=1IkdrFIH8F59ghrygR08yki58D3DycDv_