Debugging AWS Glue on PyCharm/Windows

Raymond 'Red' Reddington
3 min readFeb 4, 2021
Amazon Web Services [From Wikipedia, the free encyclopedia]

AWS Glue is a powerful service that handles various ETL tasks with many source/destination combinations. Although among the python community, it’s mainly used by the boto3 API, it also can be run/debugged locally. And AWS documentation is quite silent about it, and, (we’ll see later) not covering Windows at all…

In this example, we’re going to use the most problematic MS SQL Server integration and transform it to a CSV file located in the AWS S3 bucket.

Java Components

We’re going to put all Java-related stuff in the C:\Java folder. Downloading JDK8 from Oracle, Maven, Spark 2.4.3 and extracting as in the picture. In order to get Spark working, we also need to install Hadoop components. For this create hadoop\bin directory in the spark folder, download any 2.8.* versioned folder from this repo and put all the files there.

Environment Variables

JAVA_HOME: C:\Java\jdk1.8.0_281
SPARK_HOME: C:\Java\spark-2.4.3-bin-hadoop2.8
HADOOP_HOME: %SPARK_HOME%\hadoop

And following folders to the PATH:
C:\Java\apache-maven-3.6.0\bin
%SPARK_HOME%\bin
%HADOOP_HOME%\bin

Python

Downloading Python 3.7 and put it in the C:\Python\Python3.7.9 folder. (I have newer versions of Python, so skipped the environment variables set-up.) Then creating a virtualenv in PyCharm using the path: C:\Python\Python3.7.9\python.exe

And install specific versions of pypandoc and pyspark via pip:
pip install pypandoc==1.5
pip install pyspark==2.4.3

If all the steps above were done correctly, we should be able to use pyspark:

(venv) C:\Users\...>pyspark
Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)] on win32
Type “help”, “copyright”, “credits” or “license” for more information.
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020 18:58:18)
SparkSession available as ‘spark’.
>>>

AWS Glue

Returning to AWS documentation complete the following steps:
— Download the AWS Glue Python library from GitHub
— Check out branch glue-1.0

Then open PowerShell in the cloned repo folder and run the command:
mvn -f ./pom.xml -DoutputDirectory=jars dependency:copy-dependencies

Now we have dependency libraries for AWS Glue (\jars folder), which I’d prefer to copy to my project directory. Also, the \awsglue folder, which contains Python API, should be copied to our virtualenv. Thus, we get the following structure:

ETL

Returning to our data migration, here’s the full code of the Glue ETL job:

from copy import deepcopy
import os
from awsglue.transforms import *
from awsglue.job import Job
from pyspark.sql import SparkSession
from awsglue.context import GlueContext
# Create Spark session with Glue JARs included
jars_path = os.path.join(os.getcwd(), "jars", "*")
spark = SparkSession \
.builder \
.appName("MSSQL to CSV") \
.config("spark.driver.extraClassPath", jars_path) \
.config("spark.executor.extraClassPath", jars_path) \
.getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init('Convert')
# MSSQl Server connection template
connection_src = {
"url": "jdbc:sqlserver://localhost:1433;database=ResortsDB;",
"user": "...",
"password": "..."
}
country_conn = deepcopy(connection_src)
country_conn.update({"dbtable": "Resort.Country"})
address_conn = deepcopy(connection_src)
address_conn.update({"dbtable": "Resort.Address"})
cty = glueContext.create_dynamic_frame.from_options(
connection_type="sqlserver",
connection_options=country_conn
)
add = glueContext.create_dynamic_frame.from_options(
connection_type="sqlserver",
connection_options=address_conn
)
cty_norm = cty \
.drop_fields(['CreatedDate', 'ModifiedDate']) \
.rename_field('CountryId', 'c_id') \
.rename_field('Name', 'c_name')
full_add = Join.apply(add, cty_norm, 'CountryId', 'c_id') \
.drop_fields(['CountryId', 'c_id']) \
.rename_field('c_name', 'Country') \
.select_fields(['Name', 'City', 'State', 'Zip', 'Country'])
# Combine partitions into one to have one output file
full_addresses_repartitioned = full_add.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame=full_addresses_repartitioned,
connection_type="s3",
connection_options={"path": "s3://.../addresses/"},
format="csv",
format_options={"withHeader": True, "separator": ","}
)
job.commit()

Having correct MS SQL Server and AWS S3 configurations it can be debugged/run giving the following output:

--

--