
Ultimate Guide to Prepare Associate-Developer-Apache-Spark with Accurate PDF Questions [Mar 03, 2022]
Pass Databricks With Fast2test Exam Dumps
NEW QUESTION 89
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to
30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__((__2__.__3__) __4__ (__5__))
- A. 1. select
2. col("storeId")
3. between(20, 30)
4. &&
5. col("productId")=2 - B. 1. select
2. col("storeId")
3. between(20, 30)
4. &
5. col("productId")==2 - C. 1. select
2. col("storeId")
3. between(20, 30)
4. and
5. col("productId")==2 - D. 1. where
2. col("storeId")
3. geq(20).leq(30)
4. &
5. col("productId")==2 - E. 1. select
2. "storeId"
3. between(20, 30)
4. &&
5. col("productId")==2
Answer: A
Explanation:
Explanation
Correct code block:
transactionsDf.select((col("storeId").between(20, 30)) & (col("productId")==2)) Although this question may make you think that it asks for a filter or where statement, it does not. It asks explicity to return a column with booleans - this should point you to the select statement.
Another trick here is the rarely used between() method. It exists and resolves to ((storeId >= 20) AND (storeId
<= 30)) in SQL. geq() and leq() do not exist.
Another riddle here is how to chain the two conditions. The only valid answer here is &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 90
The code block displayed below contains an error. The code block should configure Spark to split data in 20 parts when exchanging data between executors for joins or aggregations. Find the error.
Code block:
spark.conf.set(spark.sql.shuffle.partitions, 20)
- A. The code block expresses the option incorrectly.
- B. The code block is missing a parameter.
- C. The code block sets the wrong option.
- D. The code block uses the wrong command for setting an option.
- E. The code block sets the incorrect number of parts.
Answer: A
Explanation:
Explanation
Correct code block:
spark.conf.set("spark.sql.shuffle.partitions", 20)
The code block expresses the option incorrectly.
Correct! The option should be expressed as a string.
The code block sets the wrong option.
No, spark.sql.shuffle.partitions is the correct option for the use case in the question.
The code block sets the incorrect number of parts.
Wrong, the code block correctly states 20 parts.
The code block uses the wrong command for setting an option.
No, in PySpark spark.conf.set() is the correct command for setting an option.
The code block is missing a parameter.
Incorrect, spark.conf.set() takes two parameters.
More info: Configuration - Spark 3.1.2 Documentation
NEW QUESTION 91
The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))
- A. 1. withColumn
2. "transactionDateForm"
3. "MMM d (EEEE)"
4. "transactionDate" - B. 1. select
2. "transactionDate"
3. "transactionDateForm"
4. "MMM d (EEEE)" - C. 1. withColumn
2. "transactionDateForm"
3. "transactionDate"
4. "MM d (EEE)" - D. 1. withColumn
2. "transactionDateForm"
3. "transactionDate"
4. "MMM d (EEEE)" - E. 1. withColumnRenamed
2. "transactionDate"
3. "transactionDateForm"
4. "MM d (EEE)"
Answer: D
Explanation:
Explanation
Correct code block:
transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)")) The question specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be used for this purpose, if all existing columns are selected and a new one is added.
DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.
Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column.
The final difficulty is the date format. The question indicates that the date format Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation: Apr for April. And MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 92
Which of the following statements about Spark's DataFrames is incorrect?
- A. Spark's DataFrames are equal to Python's DataFrames.
- B. Spark's DataFrames are immutable.
- C. The data in DataFrames may be split into multiple chunks.
- D. Data in DataFrames is organized into named columns.
- E. RDDs are at the core of DataFrames.
Answer: A
Explanation:
Explanation
Spark's DataFrames are equal to Python's or R's DataFrames.
No, they are not equal. They are only similar. A major difference between Spark and Python is that Spark's DataFrames are distributed, whereby Python's are not.
NEW QUESTION 93
Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of DataFrame transactionsDf, and null if predError is null?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
- A. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.count_to_target_udf = udf(count_to_target)
9.
10.transactionsDf.select(count_to_target_udf('predError')) - B. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.transactionsDf.select(count_to_target(col('predError'))) - C. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = [range(target)]
6. return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])
9.
10.transactionsDf.select(count_to_target_udf(col('predError'))) - D. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
9.
10.transactionsDf.select(count_to_target_udf('predError'))
(Correct) - E. 1.def count_to_target(target):
2. result = list(range(target))
3. return result
4.
5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
6.
7.df = transactionsDf.select(count_to_target_udf('predError'))
Answer: D
Explanation:
Explanation
Correct code block:
def count_to_target(target):
if target is None:
return
result = list(range(target))
return result
count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
transactionsDf.select(count_to_target_udf('predError'))
Output of correct code block:
+--------------------------+
|count_to_target(predError)|
+--------------------------+
| [0, 1, 2]|
| [0, 1, 2, 3, 4, 5]|
| [0, 1, 2]|
| null|
| null|
| [0, 1, 2]|
+--------------------------+
This question is not exactly easy. You need to be familiar with the syntax around UDFs (user-defined functions). Specifically, in this question it is important to pass the correct types to the udf method - returning an array of a specific type rather than just a single type means you need to think harder about type implications than usual.
Remember that in Spark, you always pass types in an instantiated way like ArrayType(IntegerType()), not like ArrayType(IntegerType). The parentheses () are the key here - make sure you do not forget those.
You should also pay attention that you actually pass the UDF count_to_target_udf, and not the Python method count_to_target to the select() operator.
Finally, null values are always a tricky case with UDFs. So, take care that the code can handle them correctly.
More info: How to Turn Python Functions into PySpark Functions (UDF) - Chang Hsin Lee - Committing my thoughts to words.
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 94
In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column productId?
1. importedDf.createOrReplaceTempView("importedDf")
2. spark.sql("SELECT * FROM importedDf WHERE productId != 3")
3. spark.sql("FILTER * FROM importedDf WHERE productId != 3")
4. importedDf = spark.read.option("format", "json").path(jsonPath)
5. importedDf = spark.read.json(jsonPath)
- A. 5, 1, 3
- B. 4, 1, 3
- C. 5, 1, 2
- D. 4, 1, 2
- E. 5, 2
Answer: C
Explanation:
Explanation
Correct code block:
importedDf = spark.read.json(jsonPath)
importedDf.createOrReplaceTempView("importedDf")
spark.sql("SELECT * FROM importedDf WHERE productId != 3")
Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark's DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.
In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().
The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 95
Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column productId from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
- A. transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})
- B. transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))
- C. transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
- D. transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))
- E. transactionsDf.max('value').min('value')
Answer: D
Explanation:
Explanation
transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest')) Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.
transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")}) Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong.
If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.
transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest')) Incorrect. While this is valid Spark syntax, it does not achieve what the question asks for. The question specifically asks for values to be aggregated per value in column productId - this column is not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.
transactionsDf.max('value').min('value')
Wrong. There is no DataFrame.max() method in Spark, so this command will fail.
transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest")) No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand which columns you want to aggregate.
More info: pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 96
Which of the following is one of the big performance advantages that Spark has over Hadoop?
- A. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
- B. Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
- C. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
- D. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
- E. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.
Answer: E
Explanation:
Explanation
Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
Wrong, there is no "DAG format". DAG stands for "directed acyclic graph". The DAG is a means of representing computational steps in Spark. However, it is true that Hadoop does not use a DAG.
The introduction of the DAG in Spark was a result of the limitation of Hadoop's map reduce framework in which data had to be written to and read from disk continuously.
Graph DAG in Apache Spark - DataFlair
Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
No. Spark can certainly store data in HDFS (as well as other formats), but this is not a key performance advantage over Hadoop. Hadoop can use multiple file formats, not only parquet.
Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
No, resiliency is not asked for in the question. The question is about performance improvements.
Both Hadoop and Spark can be deployed on Kubernetes.
Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
No. DataFrames are a concept in Spark, but not in Hadoop.
NEW QUESTION 97
Which of the following code blocks produces the following output, given DataFrame transactionsDf?
Output:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+
- A. transactionsDf.rdd.formatSchema()
- B. print(transactionsDf.schema)
- C. transactionsDf.printSchema()
- D. transactionsDf.rdd.printSchema()
- E. transactionsDf.schema.print()
Answer: C
Explanation:
Explanation
The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD documentation linked below). The output of print(transactionsDf.schema) is this:
StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),StructField (value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true),StructFiel It includes the same information as the nicely formatted original output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.
More info:
- pyspark.RDD: pyspark.RDD - PySpark 3.1.2 documentation
- DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION 98
Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
- A. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType()),
3. StructField("attributes", StringType()),
4. StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath) - B. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType()),
3. StructField("attributes", ArrayType([StringType()])),
4. StructField("supplier", StringType())])
5.
6.itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath) - C. 1.itemsDf = spark.read.schema('itemId integer, attributes <string>, supplier string').parquet(filePath)
- D. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType),
3. StructField("attributes", ArrayType(StringType)),
4. StructField("supplier", StringType)])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath) - E. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType()),
3. StructField("attributes", ArrayType(StringType())),
4. StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
Answer: E
Explanation:
Explanation
The challenge in this question comes from there being an array variable in the schema. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by spark.read.
The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array<string>, supplier string.
A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so:
StringType(). Just using StringType does not work in pySpark and will fail.
Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the question just asks for a
"valid"
schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample.
More info: Learning Spark, 2nd Edition, Chapter 3
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 99
The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes contains the element cozy.
A sample of DataFrame itemsDf is below.
Code block:
itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))
- A. 1. filter
2. "array_contains(attributes, 'cozy')"
3. select
4. "itemId"
5. map
6. "attributes" - B. 1. filter
2. "array_contains(attributes, cozy)"
3. select
4. "itemId"
5. explode
6. "attributes" - C. 1. filter
2. "array_contains(attributes, 'cozy')"
3. select
4. "itemId"
5. explode
6. "attributes" - D. 1. filter
2. array_contains("cozy")
3. select
4. "itemId"
5. explode
6. "attributes" - E. 1. where
2. "array_contains(attributes, 'cozy')"
3. select
4. itemId
5. explode
6. attributes
Answer: C
Explanation:
Explanation
The correct code block is:
itemsDf.filter("array_contains(attributes, 'cozy')").select("itemId", explode("attributes")) The key here is understanding how to use array_contains(). You can either use it as an expression in a string, or you can import it from pyspark.sql.functions. In that case, the following would also work:
itemsDf.filter(array_contains("attributes", "cozy")).select("itemId", explode("attributes")) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/29.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 100
The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__(__4__)
- A. 1. filter
2. col("storeId")==25
3. take
4. 5 - B. 1. filter
2. "storeId"==25
3. collect
4. 5 - C. 1. filter
2. col("storeId")==25
3. toLocalIterator
4. 5 - D. 1. select
2. storeId==25
3. head
4. 5 - E. 1. filter
2. col("storeId")==25
3. collect
4. 5
Answer: A
Explanation:
Explanation
The correct code block is:
transactionsDf.filter(col("storeId")==25).take(5)
Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.
The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.
The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col("storeId")==25.
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/24.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 101
Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+
- A. itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))
- B. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))
- C. itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").co
- D. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contain
- E. itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))
Answer: D
Explanation:
Explanation
Result of correct code block:
+-------------------+
|attributes_exploded|
+-------------------+
| winter|
| cooling|
+-------------------+
To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below).
Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 102
Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?
- A. transactionsDf.unique("productId")
- B. transactionsDf.dropDuplicates(subset=["productId"])
- C. transactionsDf.distinct("productId")
- D. transactionsDf.dropDuplicates(subset="productId")
- E. transactionsDf.drop_duplicates(subset="productId")
Answer: B
Explanation:
Explanation
Although the question suggests using a method called unique() here, that method does not actually exist in PySpark. In PySpark, it is called distinct(). But then, this method is not the right one to use here, since with distinct() we could filter out unique values in a specific column.
However, we want to return the entire rows here. So the trick is to use dropDuplicates with the subset keyword parameter. In the documentation for dropDuplicates, the examples show that subset should be used with a list. And this is exactly the key to solving this question: The productId column needs to be fed into the subset argument in a list, even though it is just a single column.
More info: pyspark.sql.DataFrame.dropDuplicates - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION 103
In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?
1. itemsDf.createOrReplaceView("itemsDf")
2. spark.sql("FROM itemsDf SELECT 'supplier', explode('Attributes')")
3. spark.sql("FROM itemsDf SELECT supplier, explode(attributes)")
4. itemsDf.createOrReplaceTempView("itemsDf")
- A. 0
- B. 1, 3
- C. 1, 2
- D. 4, 2
- E. 4, 3
Answer: E
Explanation:
Explanation
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 104
Which of the following describes the role of tasks in the Spark execution hierarchy?
- A. Within one task, the slots are the unit of work done for each partition of the data.
- B. Tasks are the smallest element in the execution hierarchy.
- C. Tasks are the second-smallest element in the execution hierarchy.
- D. Tasks with wide dependencies can be grouped into one stage.
- E. Stages with narrow dependencies can be grouped into one task.
Answer: B
Explanation:
Explanation
Stages with narrow dependencies can be grouped into one task.
Wrong, tasks with narrow dependencies can be grouped into one stage.
Tasks with wide dependencies can be grouped into one stage.
Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage.
Tasks are the second-smallest element in the execution hierarchy.
No, they are the smallest element in the execution hierarchy.
Within one task, the slots are the unit of work done for each partition of the data.
No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
NEW QUESTION 105
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
- A. spark.createDataFrame(throughputRates, FloatType)
- B. spark.createDataFrame((throughputRates), FloatType)
- C. spark.createDataFrame(throughputRates)
- D. spark.DataFrame(throughputRates, FloatType)
- E. spark.createDataFrame(throughputRates, FloatType())
Answer: E
Explanation:
Explanation
spark.createDataFrame(throughputRates, FloatType())
Correct! spark.createDataFrame is the correct operator to use here and the type FloatType() which is passed in for the command's schema argument is correctly instantiated using the parentheses.
Remember that it is essential in PySpark to instantiate types when passing them to SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object.
spark.createDataFrame((throughputRates), FloatType)
No. While packing throughputRates in parentheses does not do anything to the execution of this command, not instantiating the FloatType with parentheses as in the previous answer will make this command fail.
spark.createDataFrame(throughputRates, FloatType)
Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the explanation of the previous answer for further insights.
spark.DataFrame(throughputRates, FloatType)
Wrong. There is no SparkSession.DataFrame() method in Spark.
spark.createDataFrame(throughputRates)
False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see in the documentation (linked below), the inference will only work if you pass in an "RDD of either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are passing a Python list, Spark's schema inference will fail.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 106
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?
- A. transactionsDf.sample(0.15)
- B. transactionsDf.resample(0.15, False, 3142)
- C. transactionsDf.sample(0.15, False, 3142)
- D. transactionsDf.sample(True, 0.15, 8261)
- E. transactionsDf.sample(0.85, 8429)
Answer: D
Explanation:
Explanation
Answering this question correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows:
DataFrame.sample(withReplacement=None, fraction=None, seed=None).
The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.
About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the same item twice.
The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the question we are asked for 150 out of 1000 items - a fraction of 0.15.
The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 107
Which of the following describes Spark's way of managing memory?
- A. Storage memory is used for caching partitions derived from DataFrames.
- B. Spark uses a subset of the reserved system memory.
- C. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
- D. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
- E. Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
Answer: A
Explanation:
Explanation
Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
No, it is either execution or storage.
As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
No, Spark's garbage collection runs faster on fewer big objects than many small objects.
Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
The opposite is true - serialization reduces the memory footprint, but may impact performance in a negative way.
Spark uses a subset of the reserved system memory.
No, the reserved system memory is separate from Spark memory. Reserved memory stores Spark's internal objects.
More info: Tuning - Spark 3.1.2 Documentation, Spark Memory Management | Distributed Systems Architecture, Learning Spark, 2nd Edition, Chapter 7
NEW QUESTION 108
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
- A. 1. where
2. "Sports".isin(col("Supplier"))
3. "itemName"
4. array_explode("attributes") - B. 1. filter
2. col("supplier").isin("Sports")
3. "itemName"
4. explode(col("attributes")) - C. 1. where
2. col("supplier").contains("Sports")
3. "itemName"
4. "attributes" - D. 1. where
2. col(supplier).contains("Sports")
3. explode(attributes)
4. itemName - E. 1. filter
2. col("supplier").contains("Sports")
3. "itemName"
4. explode("attributes")
Answer: E
Explanation:
Explanation
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this question is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer.
The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 109
Which of the following statements about executors is correct?
- A. Executors store data in memory only.
- B. Executors stop upon application completion by default.
- C. Executors are launched by the driver.
- D. Each node hosts a single executor.
- E. An executor can serve multiple applications.
Answer: B
Explanation:
Explanation
Executors stop upon application completion by default.
Correct. Executors only persist during the lifetime of an application.
A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle, independent of whether the application has been completed or not.
An executor can serve multiple applications.
Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above).
Each node hosts a single executor.
No. Each node can host one or more executors.
Executors store data in memory only.
No. Executors can store data in memory or on disk.
Executors are launched by the driver.
Incorrect. Executors are launched by the cluster manager on behalf of the driver.
More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to clear some of the... | by Mageswaran D | Medium
NEW QUESTION 110
......
Latest Associate-Developer-Apache-Spark Exam Dumps - Valid and Updated Dumps: https://www.fast2test.com/Associate-Developer-Apache-Spark-premium-file.html
Fully Updated Associate-Developer-Apache-Spark Dumps - 100% Same Q&A In Your Real Exam: https://drive.google.com/open?id=1iDz4j4lanYFu-HJ6XGIqfli5eLCKssK0