Create MapType Column from Existing Columns in PySpark
An RDD transformation that applies the transformation function to every element of the data frame is known as a map in Pyspark. There occurs various situations when you have numerous columns and you need to convert them to map-type columns. It can be done easily by using the create_map function with the map key column name and column name as arguments. Continue reading the article further to know about it in detail.
Syntax: df.withColumn("map_column_name",create_map( lit("mapkey_1"),col("column_1"), lit("mapkey_2"),col("column_2") )).drop( "column_1", "column_2" ).show(truncate=False)
Here,
- column_1, column_2, column_3: These are the column names which needs to be converted to map.
- mapkey_1, mapkey_2, mapkey_3: These are the names of the map keys to be given to data on creation of map.
- map_column_name: It is the name given to the column in which map is stored.
Example 1:
In this example, we have used a data set (link), which is basically a 5x5 data frame as follows:

Then, we converted the columns 'name,' 'class' and 'fees' to map using the create_map function and stored them in the column 'student_details' dropping the existing 'name,' 'class' and 'fees' columns.
# PySpark - Create MapType Column from existing columns
# Import the libraries SparkSession, col, lit, create_map
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, create_map
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
'/content/class_data.csv', sep=',', inferSchema=True, header=True)
# Convert name, class and fees columns to map
data_frame = data_frame.withColumn("student_details",
create_map(lit("student_name"),
col("name"),
lit( "student_class"),
col("class"),
lit("student_fees"),
col("fees"))).drop("name",
"class",
"fees")
# Display the data frame
data_frame.show(truncate=False)
Output:

Example 2:
In this example, we have created a data frame with columns emp_id, name, superior_emp_id, year_joined, emp_dept_id, gender, and salary as follows:

Then, we converted the columns name, superior_emp_id, year_joined, emp_dept_id, gender, and salary to map using the create_map function and stored in the column 'employee_details' dropping the existing name, superior_emp_id, year_joined, emp_dept_id, gender, and salary columns.
#PySpark - Create MapType Column from existing columns
# Import the libraries SparkSession, col, lit, create_map
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,lit,create_map
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Define the data set
emp = [(1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","F",4000),
(6,"Brown",2,"2010","50","M",2000) ]
# Define the schema of the data set
empColumns = ["emp_id","name","superior_emp_id",
"year_joined", "emp_dept_id",
"gender","salary"]
# Create the data frame through data set and schema
empDF = spark_session.createDataFrame(data=emp,
schema = empColumns)
# Convert name, superior_emp_id, year_joined, emp_dept_id, gender, and salary columns to maptype column
empDF = empDF.withColumn("employee_details",
create_map(lit("name"),
col("name"),
lit("superior_emp_id"),
col("superior_emp_id"),
lit("year_joined"),
col("year_joined"),
lit("emp_dept_id"),
col("emp_dept_id"),
lit("gender"),
col("gender"),
lit("salary"),
col("salary"))).drop("name",
"superior_emp_id",
"year_joined",
"emp_dept_id",
"gender",
"salary")
# Display the data frame
empDF.show(truncate=False)
Output:
.png)