Skip to content

TamarJanssens/fdx_dwh

Repository files navigation

Analytics Engineer Assessment 2024

Author: Tamar Janssens

This projects concerns the workout of an Assessment related to an Analytics Engineering Role.

The project has proven to be quite some work and challenging at times, but it has been a pleasure to work on. I hope you will enjoy going through my work as much as I did while developing it.

Note

The DEV branch is AHEAD of main. I have not merged them intentially for the sake of an honest and clear evaluation.

Changes concern the following:

  • Additional cleaning on shipState source data (replacing characters not matching with a-z, A-Z or 0-9 by '_' )
  • The location Dimensions that were still missing in the main branch
  • Country data loaded from web (JSON), processed with Pandas into a seed
  • cityState dimension and explanation at the bottom of this README.md of this choice
  • Pivoted Mart model that counts amount of sales by SKU and Year
  • Staging model that joins orders and related tables

No other relevant modifications were made in this MAIN branch after I had invited my future colleagues ;)

Installation

Activate Virtual Environment

After cloning this repo, activate the virtual environment

source ./venv/bin/activate 

Install dependencies

# provide execution permissions to the install script

chmod +x install.sh

Launch the Docker containing a local SQL Edge database

SQL Edge is a lightweight SQL Server which is able to be installed on IoT devices

# cd into the main project directory
# run:

docker-compose up -d

A free Database client that is able to connect to the Dockerized is Azure Data Studio, which can be downloaded from here: https://learn.microsoft.com/en-us/azure-data-studio/download-azure-data-studio?tabs=win-install%2Cwin-user-install%2Credhat-install%2Cwindows-uninstall%2Credhat-uninstall

Setup your dbt_profiles.yml in ~/.dbt

  1. Copy the following configuration into profiles.yml
dbt_fedex:
  target: dev
  outputs:
    dev:
      type: sqlserver
      driver: 'ODBC Driver 18 for SQL Server' 
      server: localhost
      port: 57000
      database: master
      schema: fdx
      user: sa
      password: tamar-janssens2024
      threads: 4
      TrustServerCertificate: yes

Important

For the sake of ease I have included the password here as well as in the docker-compose.yml and the src/upload_data.py script. I ackknowledge this is very bad practice and hence I would never do this in real life. I would use a (mounted)) environment variable or cloud parameter store, never upload it to Git and make sure it would not appear in any logs.

Same applies to the data which is uploaded to Git.

Or use command line to paste it into a temporary file

sudo pbpaste > ~/.dbt/RENAME_profiles.yml
cat ~/.dbt/RENAME_profiles.yml

Now ensure that dbt is correctly configured

dbt debug 

Install packages

 dbt deps

Load Countries from web

cd src
python load_countries.py
dbt seed

Data Cleaning and Preliminary Analysis

Note

Locate the notebook in fedex_dwh/notebooks/clean_source.ipynb that contains this analysis

Inserting the raw data into the local dtabase

Warning

Ensure that the Docker containing the Database is running

run

python ./src/upload_data.py

Installing Apache SuperSet

Installation instructions can be found here: https://superset.apache.org/docs/installation/installing-superset-using-docker-compose

  1. cd into project directory
  2. clone the Superset repository
git clone https://github.com/apache/superset.git
cd superset

Warning

There seems to be a recent (2 weeks old) issue in the SuperSet build related to the sqlglot package

A fix to this issue is posted here on Github

https://github.com/apache/superset/issues/26997#issuecomment-1933769661

  1. fix the issue
cd docker
echo "sqlglot==20.8.0" > requirements-local.txt
  1. after fixing this issue, run SuperSet
docker-compose up -d

Note

In case of conflicts, the following command can show other processes running at the default SuperSet ports:

lsof -i :8088 # 5432, 8080

Or to LISTEN to open ports:

sudo lsof -i -P | grep LISTEN
  1. a. Change the docker-compose file to use unused ports
  1. Once installation turns out to be succesful, login to superset with default user:pass as admin:admin)

Building op the Kimball DWH

Based on evaluating constant fields per field, I have identified the following Facts and Dimensions

Orders [FACT]

  • orderId
  • date
  • status
  • b2b
  • fulfilment
  • salesChannel
  • shipServiceLevel
  • country_id
  • cityState_id
  • shipPostalCode

OrderItems [FACT}

  • orderId [FK]
  • sku [FK]
  • qty
  • promotionIds *
  • currency
  • amount
  • courierStatus
  • fulfilledBy

Note

  • I could likely re-construct a promotion table based on a price analysis that defines a base price and evaluates the price markeup/discount within a given time period based on the promotionId and the price paid for the orderItems. This is however beyond the scope of this exercise.

Date [Dimension]

In order to prevent the date dimension from gaps I used/Googled a macro to generate a series and modified it to fit my use case.

SKU [Dimension]

  • sku
  • style
  • category
  • size
  • asin

Location related Dimensions

Note

Since locations can have different names, it is relevant to create separate dimensions for Country, State and City. This way synonyms can be stored across the dimensions to enforce uniques with respect to similar locations.

CityState [Dimension]

Since Citynames can occur in multiple states within a country, I choose to create a CityState dimension with unique City and State combinations

Country [Dimension]

Country data is obtained from the internet from datahub.io and retrieved with the script in src/load_countries.py

link:

https://pkgstore.datahub.io/core/country-codes/country-codes_json/data/616b1fb83cbfd4eb6d9e7d52924bb00a/country-codes_json.json

Orders By Product [MART]

I've added a Mart model that computes the amount of orders by SKU and Year and pivots the results such each year has its own column. The data is filtered for country IN now.

model mrt/pivot_orders_by_sku_year_IN

To assist this model and keep it readable, a view model is created in stg/orders_joined.sql. This model contains joins with all tables, such it can be re-used for other Mart models.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors