Updates for 2024 #63

mrdbourke · 2023-09-07T04:12:37Z

mrdbourke
Sep 7, 2023
Maintainer

mrdbourke · 2023-09-08T00:55:58Z

mrdbourke
Sep 8, 2023
Maintainer Author

Pandas notebook updates

Specifying numeric data types

numeric_only=False now default when calling .mean() on a DataFrame, specify numeric_only=True when calling .mean(), e.g. car_prices.mean(numeric_only=True), see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html
same goes for .sum() on a DataFrame, numeric_only=False defaults to False, see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html
- In light of the above, added numeric_only=True to most sections which call .mean() or .sum() on a DataFrame.

String replacement

To replace a str in pandas, you now must specify whether to use regex or not, the default is False for example:

# Remove price column symbols
car_sales["Price"] = car_sales["Price"].str.replace('[\$\,\.]', '', 
                                                    regex=True) # Tell pandas to replace using regex

See the documentation for more: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

4 replies

mrdbourke Nov 26, 2023
Maintainer Author

A comment from Mophers, is this taken care of?

tmcm31416 Dec 20, 2023

Hey all - just starting out on the AI/ML course and was also banging my head against a wall on the replace of the $ from the Price dataframe.
Here is the solution as of today, nothing else worked for me - and I spent like 5 hours trying permutations of all solutions.
I could get the replace to work on the commas and the decimal points, but NOT ON the $ sign.
I don't know why the "r" makes a difference, but it does. Maybe someone with more mastery of pandas/python might be able to add context. Here's what worked:
car_sales["Price"] = car_sales["Price"].str.replace(r"$","", regex=True)

for some reason, the backslash before the $ sign in the above code line disappears once I save this comment - please enter it in your code, as otherwise the code won't work - as in backslash$

Just replace the backslash$ with a backslash, to remove the comma from the string.

Once these two replaces are done, convert to float to use as numbers for the plot() and hist() function calls:

car_sales["Price"] = car_sales["Price"].astype(float)

Hope this helps folks.

Marcogoodie Dec 24, 2023

~~Think the r stands for regular syntax which nullifies special chars~~

Edit: Apologies, r stands for raw string, it's a python thing, more info linked here: https://gaurav-patil21.medium.com/why-do-we-use-r-in-regular-expressions-regex-52100dc7ce41

gitjbnguyen Apr 1, 2024

For anyone having trouble with Section 6: Viewing and Selecting Data with Pandas Part 2 video

Here is the solution to the invalid escape sequence error, it involves having to convert from object to string, apply the str.replace() method, and finally converting it to the integer data type to be able to perform numeric calculations:

because Price's datatype is an object, you have to convert it to an integer.
Remove price column symbols
Convert the 'Price' column to string data type
car_sales['Price'] = car_sales['Price'].astype(str)

Now apply the .str.replace() method, use double slashes (I tried to put double backslashes in this response but it is giving me single slashes)
car_sales["Price"] = car_sales["Price"].str.replace('[\$\,\.]', '', regex=True)

Now convert it to an integer
car_sales['Price'] = car_sales['Price'].astype(int)

Optional: you want to fix your car prices to be more accurate than an inflated price
car_sales['Price'] = (car_sales['Price'].astype(int) // 100)

mrdbourke · 2023-09-15T02:07:36Z

mrdbourke
Sep 15, 2023
Maintainer Author

Matplotlib notebook updates

General workflow

In later (2022 onwards) versions of Jupyter Notebooks, %matplotlib inline is no longer needed, see this Stack Overflow thread for more

Trying to plot non-numeric columns

In previous versions of matplotlib, trying to plot strings on certain axes would return an error, it now looks like it's possible (however, the plot will likely not look as good as it does when plotting numerics). For example:

# Note: In previous versions of matplotlib and pandas, have the "Price" column as a string would
# return an error
car_sales["Price"] = car_sales["Price"].astype(str)
# car_sales["Price"] = car_sales["Price"].astype(int) # Turning the Price column into an integer looks better

# Plot a scatter plot (does not look as good as with .astype(int))
car_sales.plot(x="Odometer (KM)", y="Price", kind="scatter");

Seaborn plotting styles namespace change

Older versions of the seaborn plot style have been changed to their version number (e.g. seaborn-whitegrid -> seaborn-v0_8-whitegrid), matplotlib now recommends if you'd like seaborn styling to go directly to the seaborn API.
- You can see the new seaborn styles namespace by using plt.style.available and then use them with plt.style.use("seaborn-v0_8-whitegrid")

0 replies

mrdbourke · 2023-09-17T23:13:39Z

mrdbourke
Sep 17, 2023
Maintainer Author

Scikit-Learn notebook updates

plot_roc_curve = not available in Scikit-Learn 1.2+, see: The plot_roc_curve is not supported in the shown version #65
Default number of n_estimators for RandomForestClassifier and RandomForestRegressor is now n_estimators=100, changed the notebook to reflect this as the baseline (e.g. for hyperparameter tuning, instead of going from 10 -> 100, went from 100 -> 200)
Renamed dictionary with hyperparameters to search across with RandomizedSearchCV from grid to param_distributions
Renamed dictionary with hyperparameters to search across with GridSearchCV from grid_2 to param_grid

RandomForestClassifier

"max_features" parameter no longer supports "auto", options are: {“sqrt”, “log2”, None}, int or float, default=”sqrt”, see: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Updated RandomSearchCV section to include this, see:

# Hyperparameter grid RandomizedSearchCV will search over
param_distributions = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
                       "max_depth": [None, 5, 10, 20, 30],
                       "max_features": ["sqrt", "log2", None],
                       "min_samples_split": [2, 4, 6],
                       "min_samples_leaf": [1, 2, 4]}

from sklearn.model_selection import RandomizedSearchCV, train_test_split

np.random.seed(42)

# Split into X & y
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all available cores on your machine (if this causes errors, try n_jobs=1)
clf = RandomForestClassifier(n_jobs=-1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=param_distributions,
                            n_iter=20, # try 20 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

Creation of train/validation/test set

Changed creation of train/validation/test sets from indexing to random splitting.

I find this cleaner and less prone to error.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Set the seed
np.random.seed(42)

# Read in the data
heart_disease = pd.read_csv("../data/heart-disease.csv")

# Split into X (features) & y (labels)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Training and test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create validation and test split by spliting testing data in half (30% test -> 15% validation, 15% test)
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make predictions
y_preds = clf.predict(X_valid)

# Evaluate the classifier
baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics

Pipeline upgrades

Removed "auto" from pipeline grid search dictionary (old: "model__max_features": ["auto", "sqrt"]", new: "model__max_features": ["sqrt"])
- Full code:

pipe_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"], # note the double underscore after each prefix "preprocessor__"
    "model__n_estimators": [100, 1000],
    "model__max_depth": [None, 5],
    "model__max_features": ["sqrt"],
    "model__min_samples_split": [2, 4]
}

4.2.1 Classification model evaluation metrics - ROC Curve

Added code example of using sklearn.metrics.RocCurveDisplay.from_estimator (new in Scikit-Learn 1.2+), as discussed in The plot_roc_curve is not supported in the shown version #65

from sklearn.metrics import RocCurveDisplay
roc_curve_display = RocCurveDisplay.from_estimator(estimator=clf, 
                                                   X=X_test, 
                                                   y=y_test)

1 reply

gitjbnguyen Apr 18, 2024

Using this line of code results in this error:
df_large_error = df.copy()
df_large_error.iloc[0]["squared_differences"] = 16 # increase squared differences for 1 sample

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use df.loc[row_indexer, "col"] = values instead, to perform the assignment in a single step and ensure this keeps updating the original df.

But this can be fixed with this solution:
df_large_error = df.copy()
df_large_error.iloc[0, df_large_error.columns.get_loc("squared_differences")] = 16

sithucodes · 2023-10-01T00:21:17Z

sithucodes
Oct 1, 2023

In the lecture Hyperparameter tuning with RandomizedSearchCV

Remove auto in max_features.

Changed in version 1.1: The default of max_features changed from "auto" to 1.0.

Read more in scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

0 replies

sithucodes · 2023-10-16T12:07:05Z

sithucodes
Oct 16, 2023

Getting `False` in `pd.api.types.is_string_dtype(df_tmp["UsageBand"])`

Daniel got True in the video lecture but you will get False. That is because of the version update.

Object datatype is not considered as String datatype anymore. So, you can try like that:

pd.api.types.is_object_dtype(df_tmp["UsageBand"])

Read more here -> https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_object_dtype.html

2 replies

mkchong0710 Apr 2, 2024

Thanks a lot for the highlight! Really helps!

SleepyKumiho Apr 22, 2024

I was so confused here thank you!

mrdbourke · 2023-10-26T02:32:12Z

mrdbourke
Oct 26, 2023
Maintainer Author

TensorFlow Notebook Updates

Due to changes in workflow/TensorFlow library updates, going to remake the Dog Vision project.

This will be the latest version of TensorFlow (2.14.0, as of October 2023).

Currently the notebook will be under the end-to-end-dog-vision-v2.ipynb namespace.

See: https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-4-unstructured-data-projects/end-to-end-dog-vision-v2.ipynb

0 replies

arpadikuma · 2023-11-22T00:30:46Z

arpadikuma
Nov 22, 2023

There's an increase in students encountering errors when installing jupyter or creating a conda env with jupyter, with Python 3.12 somehow already installed.

The error message always is something like this:

pin-1 is not installable because it requires
    python 3.12.*, which conflicts with any installable versions previously reported.
    
Pins seem to be involved in the conflict. Currently pinned specs:
 - python 3.12.* (labeled as 'pin-1')

You can install jupyter with pip: pip install jupyter

However, since Python 3.12 is still very new and the possibility of encountering compatibility issues is still high, I recommend the following:

create a new env with conda with just python 3.10 or 3.11
activate this new env
install the required libraries

i.e., go to the folder you want to create the env in and execute:
conda create -p ./env python=3.10 (for python 3.10, for python 3.11 replace the 10 with 11)

Then activate that env you just created
(from the same location, do conda activate ./env)

Then install the libraries you wanted to have on that env, i.e. conda install scikit-learn numpy pandas matplotlib jupyter
And any others if necessary

0 replies

Updates for 2024 #63

Uh oh!

Uh oh!

mrdbourke Sep 7, 2023 Maintainer

Working on updates for 2024

TODO

Working on

Done

Replies: 7 comments · 7 replies

Uh oh!

Uh oh!

mrdbourke Sep 8, 2023 Maintainer Author

Pandas notebook updates

Uh oh!

mrdbourke Nov 26, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrdbourke Sep 15, 2023 Maintainer Author

Matplotlib notebook updates

Uh oh!

Uh oh!

mrdbourke Sep 17, 2023 Maintainer Author

Scikit-Learn notebook updates

RandomForestClassifier

Creation of train/validation/test set

Pipeline upgrades

4.2.1 Classification model evaluation metrics - ROC Curve

Uh oh!

Uh oh!

In the lecture Hyperparameter tuning with RandomizedSearchCV

Uh oh!

Getting False in pd.api.types.is_string_dtype(df_tmp["UsageBand"])

Uh oh!

Uh oh!

Uh oh!

mrdbourke Oct 26, 2023 Maintainer Author

TensorFlow Notebook Updates

Uh oh!

Uh oh!

mrdbourke
Sep 7, 2023
Maintainer

Replies: 7 comments 7 replies

mrdbourke
Sep 8, 2023
Maintainer Author

mrdbourke Nov 26, 2023
Maintainer Author

mrdbourke
Sep 15, 2023
Maintainer Author

mrdbourke
Sep 17, 2023
Maintainer Author

Getting `False` in `pd.api.types.is_string_dtype(df_tmp["UsageBand"])`

mrdbourke
Oct 26, 2023
Maintainer Author