Install and Configure PySpark on Windows

### Software Versions and Tools

JDK8

Spark-2.4.3: [Download](https://blog-1310034074.cos.ap-hongkong.myqcloud.com/BigData/spark-2.4.3-bin-hadoop2.7.tgz)

Hadoop-2.7.1: [Download](https://blog-1310034074.cos.ap-hongkong.myqcloud.com/BigData/hadoop-2.7.1.tar.gz)

winutils-master: [Download](https://blog-1310034074.cos.ap-hongkong.myqcloud.com/BigData/winutils-master.zip)

---

### Installation Steps

**1. Install Hadoop**

Unzip the *winutils* and *Hadoop* compressed packages. When using IDEA to develop **spark** programs, you need to simulate the *Hadoop* environment in the development environment. Otherwise, every time you need to hit the jar to the cluster environment to execute the debugging program, which will seriously affect the development efficiency.

winutils is the Hadoop debugging environment tool required on Windows system, which contains some essential tools needed to debug Hadoop and Spark on Windows.

Enter the winutils directory, copy and paste all its contents into the Hadoop installation directory's bin directory, and add or replace some files.

Right-click *My Computer - Properties - Advanced System Settings - Environment Variables*, create a new system variable, set the variable name to **HADOOP_HOME**, and the variable value is the file directory address replaced in the previous step.

![](https://mujj.site/image/public/image/2024/03/22/65fd80eda5cce.png)

Find the **Path** variable, double-click it to open the edit dialogue, select New Variable, and point to the bin directory of Hadoop.

![](https://mujj.site/image/public/image/2024/03/22/65fd812bd528e.png)

Open the **etc** directory under the Hadoop folder, modify the **hadoop.env.cmd** file, and change JAVA_HOME to the address pointed to by the system variable.

![](https://mujj.site/image/public/image/2024/03/22/65fd813c7afc6.png)

---

**2.Install Python**

As Hadoop2.7 and Spark 2.4 can only use Python3.6, we use **Anaconda** to build the Python environment. After installing Anaconda, open Anaconda Navigator, select **Environments** and create a new Python3.6.13 environment.

![](https://mujj.site/image/public/image/2024/03/22/65fd815a576a8.png)

---

**3.Install Spark**

Unzip Spark and put it in the same directory as Hadoop, and configure environment variables. Copy **PySpark** in the Spark directory to the Lib directory in the Python environment.

![](https://mujj.site/image/public/image/2024/03/22/65fd816c57775.png)

Enter the Scripts directory in the Python environment, and use *pip install py4j* to install **py4j**. Py4J is a library written in Python and Java. Through Py4J, Python programs can dynamically access Java objects in the Java virtual machine, and Java programs can also call back Python objects.

![](https://mujj.site/image/public/image/2024/03/22/65fd818f6bc14.png)

Open cmd and enter *spark-shell*. At this point, the Spark configuration is successful.

---

**4.Running PySpark**

Write a test code in **Spyder.** Here, the wordcount program that counts the number of occurrences of words is used. After writing, save it for easy operation.

```python
"""
@author: JackyMu
"""
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("WordCount").setMaster("local")
sc = SparkContext(conf=conf)
inputFile = "" #file location
textFile = sc.textFile(inputFile)
wordCount = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
wordCount.foreach(print)
```

Open the **Anaconda prompt**, enter activate python36 on the command line to activate the python environment, use the command to run the program file saved in the previous step.

![](https://mujj.site/image/public/image/2024/03/22/65fd81a2a49c9.png)

---

### Summary

This blog briefly introduces the installation and configuration of PySpark under windows. When executing tasks, you can also enter <u>localhost:4040</u> in the browser to enter the task program control page started by spark.