Analytics Engineer Assessment 2024

Author: Tamar Janssens

This projects concerns the workout of an Assessment related to an Analytics Engineering Role.

The project has proven to be quite some work and challenging at times, but it has been a pleasure to work on. I hope you will enjoy going through my work as much as I did while developing it.

Note

The DEV branch is AHEAD of main. I have not merged them intentially for the sake of an honest and clear evaluation.

Changes concern the following:

Additional cleaning on shipState source data (replacing characters not matching with a-z, A-Z or 0-9 by '_' )
The location Dimensions that were still missing in the main branch
Country data loaded from web (JSON), processed with Pandas into a seed
cityState dimension and explanation at the bottom of this README.md of this choice
Pivoted Mart model that counts amount of sales by SKU and Year
Staging model that joins orders and related tables

No other relevant modifications were made in this MAIN branch after I had invited my future colleagues ;)

Installation

Activate Virtual Environment

After cloning this repo, activate the virtual environment

source ./venv/bin/activate

Install dependencies

# provide execution permissions to the install script

chmod +x install.sh

Launch the Docker containing a local SQL Edge database

SQL Edge is a lightweight SQL Server which is able to be installed on IoT devices

# cd into the main project directory
# run:

docker-compose up -d

A free Database client that is able to connect to the Dockerized is Azure Data Studio, which can be downloaded from here: https://learn.microsoft.com/en-us/azure-data-studio/download-azure-data-studio?tabs=win-install%2Cwin-user-install%2Credhat-install%2Cwindows-uninstall%2Credhat-uninstall

Setup your dbt_profiles.yml in ~/.dbt

Copy the following configuration into profiles.yml

dbt_fedex:
  target: dev
  outputs:
    dev:
      type: sqlserver
      driver: 'ODBC Driver 18 for SQL Server' 
      server: localhost
      port: 57000
      database: master
      schema: fdx
      user: sa
      password: tamar-janssens2024
      threads: 4
      TrustServerCertificate: yes

Important

For the sake of ease I have included the password here as well as in the docker-compose.yml and the src/upload_data.py script. I ackknowledge this is very bad practice and hence I would never do this in real life. I would use a (mounted)) environment variable or cloud parameter store, never upload it to Git and make sure it would not appear in any logs.

Same applies to the data which is uploaded to Git.

Or use command line to paste it into a temporary file

sudo pbpaste > ~/.dbt/RENAME_profiles.yml
cat ~/.dbt/RENAME_profiles.yml

Now ensure that dbt is correctly configured

dbt debug

Install packages

 dbt deps

Load Countries from web

cd src
python load_countries.py
dbt seed

Data Cleaning and Preliminary Analysis

Note

Locate the notebook in fedex_dwh/notebooks/clean_source.ipynb that contains this analysis

Inserting the raw data into the local dtabase

Warning

Ensure that the Docker containing the Database is running

run

python ./src/upload_data.py

Installing Apache SuperSet

Installation instructions can be found here: https://superset.apache.org/docs/installation/installing-superset-using-docker-compose

cd into project directory
clone the Superset repository

git clone https://github.com/apache/superset.git
cd superset

Warning

There seems to be a recent (2 weeks old) issue in the SuperSet build related to the sqlglot package

A fix to this issue is posted here on Github

https://github.com/apache/superset/issues/26997#issuecomment-1933769661

fix the issue

cd docker
echo "sqlglot==20.8.0" > requirements-local.txt

after fixing this issue, run SuperSet

docker-compose up -d

Note

In case of conflicts, the following command can show other processes running at the default SuperSet ports:

lsof -i :8088 # 5432, 8080

Or to LISTEN to open ports:

sudo lsof -i -P | grep LISTEN

a. Change the docker-compose file to use unused ports

Once installation turns out to be succesful, login to superset with default user:pass as admin:admin)

Building op the Kimball DWH

Based on evaluating constant fields per field, I have identified the following Facts and Dimensions

Orders [FACT]

orderId
date
status
b2b
fulfilment
salesChannel
shipServiceLevel
country_id
cityState_id
shipPostalCode

OrderItems [FACT}

orderId [FK]
sku [FK]
qty
promotionIds *
currency
amount
courierStatus
fulfilledBy

Note

I could likely re-construct a promotion table based on a price analysis that defines a base price and evaluates the price markeup/discount within a given time period based on the promotionId and the price paid for the orderItems. This is however beyond the scope of this exercise.

Date [Dimension]

In order to prevent the date dimension from gaps I used/Googled a macro to generate a series and modified it to fit my use case.

SKU [Dimension]

sku
style
category
size
asin

Location related Dimensions

Note

Since locations can have different names, it is relevant to create separate dimensions for Country, State and City. This way synonyms can be stored across the dimensions to enforce uniques with respect to similar locations.

CityState [Dimension]

Since Citynames can occur in multiple states within a country, I choose to create a CityState dimension with unique City and State combinations

Country [Dimension]

Country data is obtained from the internet from datahub.io and retrieved with the script in src/load_countries.py

link:

https://pkgstore.datahub.io/core/country-codes/country-codes_json/data/616b1fb83cbfd4eb6d9e7d52924bb00a/country-codes_json.json

Orders By Product [MART]

I've added a Mart model that computes the amount of orders by SKU and Year and pivots the results such each year has its own column. The data is filtered for country IN now.

model mrt/pivot_orders_by_sku_year_IN

To assist this model and keep it readable, a view model is created in stg/orders_joined.sql. This model contains joins with all tables, such it can be re-used for other Mart models.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
dbt		dbt
fedex.egg-info		fedex.egg-info
notebooks		notebooks
scripts		scripts
src		src
superset		superset
venv		venv
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
install.sh		install.sh
project_root.txt		project_root.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytics Engineer Assessment 2024

Installation

Activate Virtual Environment

Install dependencies

Launch the Docker containing a local SQL Edge database

Setup your dbt_profiles.yml in ~/.dbt

Load Countries from web

Data Cleaning and Preliminary Analysis

Inserting the raw data into the local dtabase

Installing Apache SuperSet

Building op the Kimball DWH

Orders [FACT]

OrderItems [FACT}

Date [Dimension]

SKU [Dimension]

Location related Dimensions

CityState [Dimension]

Country [Dimension]

Orders By Product [MART]

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analytics Engineer Assessment 2024

Installation

Activate Virtual Environment

Install dependencies

Launch the Docker containing a local SQL Edge database

Setup your dbt_profiles.yml in ~/.dbt

Load Countries from web

Data Cleaning and Preliminary Analysis

Inserting the raw data into the local dtabase

Installing Apache SuperSet

Building op the Kimball DWH

Orders [FACT]

OrderItems [FACT}

Date [Dimension]

SKU [Dimension]

Location related Dimensions

CityState [Dimension]

Country [Dimension]

Orders By Product [MART]

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages