This project builds an end-to-end data pipeline using Python and Pandas to process a retail orders dataset. The cleaned data is loaded into SQL Server, where SQL queries are used to analyze top-performing products, regional sales, monthly trends, and year-over-year growth.
The dataset is sourced from Kaggle and contains retail order records including product details, pricing, discounts, order dates, regions, and categories.
The data is downloaded and extracted using the Kaggle API. It is loaded into a Pandas DataFrame for preprocessing. The following transformations are applied:
- Missing values are handled.
- New columns are derived: discount, sale price, and profit.
- The
order_datecolumn is converted to datetime format. - Unnecessary columns such as
list_price,cost_price, anddiscount_percentare removed.
After preprocessing, the data is loaded into SQL Server using SQLAlchemy with an ODBC connection.
The connection to SQL Server is established using SQLAlchemy and PyODBC, using ODBC Driver 17 for SQL Server. After a successful connection test, the processed data is written to a table named df_orders.
To allow remote connections to the SQL Server instance SQLEXPRESS, the TCP/IP protocol was first enabled via SQL Server Configuration Manager under 'Protocols for SQLEXPRESS'.

After enabling TCP/IP, the SQL Server (SQLEXPRESS) service was restarted from the Services panel to apply the configuration changes.

An ODBC data source named myserver was configured using the server name localhost\SQLEXPRESS

In Microsoft SQL Server Management Studio (SSMS), the server name localhost\SQLEXPRESS was entered to successfully connect to the SQL Server instance.

Once data is available in the SQL Server table df_orders, several SQL queries are used to derive insights:
- Python
- Pandas
- SQLAlchemy
- PyODBC
- SQL Server
- Kaggle API




