Skip to content

Commit 0d2a836

Browse files
committed
update to use new arviz
1 parent 6788e3c commit 0d2a836

14 files changed

+3592
-3262
lines changed

examples/bart/bart_categorical_hawks.ipynb

Lines changed: 652 additions & 1733 deletions
Large diffs are not rendered by default.

examples/bart/bart_categorical_hawks.myst.md

Lines changed: 19 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ jupytext:
55
format_name: myst
66
format_version: 0.13
77
kernelspec:
8-
display_name: pymc-examples
8+
display_name: pymc
99
language: python
10-
name: pymc-examples
10+
name: python3
1111
myst:
1212
substitutions:
1313
conda_dependencies: pymc-bart
@@ -35,7 +35,7 @@ In this example, we will model outcomes with more than two categories.
3535
import os
3636
import warnings
3737
38-
import arviz as az
38+
import arviz.preview as az
3939
import matplotlib.pyplot as plt
4040
import numpy as np
4141
import pandas as pd
@@ -51,7 +51,7 @@ warnings.simplefilter(action="ignore", category=FutureWarning)
5151
```{code-cell} ipython3
5252
# set formats
5353
RANDOM_SEED = 8457
54-
az.style.use("arviz-darkgrid")
54+
az.style.use("arviz-variat")
5555
```
5656

5757
## Hawks dataset
@@ -134,7 +134,11 @@ with model_hawks:
134134

135135
### Variable Importance
136136

137-
It may be that some of the input variables are not informative for classifying by species, so in the interest of parsimony and in reducing the computational cost of model estimation, it is useful to quantify the importance of each variable in the dataset. PyMC-BART provides the function {func}`~pymc_bart.plot_variable_importance()`, which generates a plot that shows on his x-axis the number of covariables and on the y-axis the R$^2$ (the square of the Pearson correlation coefficient) between the predictions made for the full model (all variables included) and the restricted models, those with only a subset of the variables. The error bars represent the 94 % HDI from the posterior predictive distribution.
137+
It may be that some of the input variables are not informative for classifying by species, so in the interest of parsimony and in reducing the computational cost of model estimation, it is useful to quantify the importance of each variable in the dataset. PyMC-BART provides the function {func}`~pymc_bart.compute_variable_importance()` and {func}`~pymc_bart.plot_variable_importance()`, that work together to generate a plot that shows on his x-axis the number of covariables and on the y-axis the R$^2$ between the predictions made for the full model (all variables included) and the restricted models, those with only a subset of the variables. The error bars represent the 94 % HDI from the posterior predictive distribution.
138+
139+
```{code-cell} ipython3
140+
idata.posterior_predictive
141+
```
138142

139143
```{code-cell} ipython3
140144
---
@@ -190,37 +194,19 @@ Hawks.groupby("Species").count()
190194

191195
### Predicted vs Observed
192196

193-
Now we are going to compare the predicted data with the observed data to evaluate the fit of the model, we do this with the Arviz function `az.plot_ppc()`.
197+
We are going to evaluate the model by comparing the predicted data against the observed data. This can be tricky to do with categorical data (and binary or ordinal data as well). For this reason we use PAV-adjusted calibration plots as described by {cite:t}`dimitriadis2021` and in a Bayesian context by {cite:t}`säilynoja2025`.
194198

195-
```{code-cell} ipython3
196-
ax = az.plot_ppc(idata, kind="kde", num_pp_samples=200, random_seed=123)
197-
# plot aesthetics
198-
ax.set_ylim(0, 0.7)
199-
ax.set_yticks([0, 0.2, 0.4, 0.6])
200-
ax.set_ylabel("Probability")
201-
ax.set_xticks([0.5, 1.5, 2.5])
202-
ax.set_xticklabels(["CH", "RT", "SS"])
203-
ax.set_xlabel("Species");
204-
```
199+
In these plots the observed events are replaced with conditional event probabilities (CEP), which is the probability that a certain event occurs given that the classifier has assigned a specific predicted probability.
205200

206-
We can see a good agreement between the observed data (black line) and those predicted by the model (blue and orange lines). As we mentioned before, the difference in the values between species is influenced by the amount of data for each one. Here there is no observed dispersion in the predicted data as we saw in the previous two plots.
207-
208-
+++
209-
210-
Below we see that the in-sample predictions provide very good agreement with the observations.
201+
We can esily generate this type of plots with the ArviZ's function `az.plot_ppc_pava()`, as this function also works for binary and ordinal data we need to specify `data_type="categorical"`. You can read more about these plots [here](https://arviz-devs.github.io/EABM/Chapters/Prior_posterior_predictive_checks.html#posterior-predictive-checks-for-discrete-data).
211202

212203
```{code-cell} ipython3
213-
np.mean((idata.posterior_predictive["y"] - y_0) == 0) * 100
204+
az.plot_ppc_pava(idata, data_type="categorical");
214205
```
215206

216-
```{code-cell} ipython3
217-
all = 0
218-
for i in range(3):
219-
perct_per_class = np.mean(idata.posterior_predictive["y"].where(y_0 == i) == i) * 100
220-
all += perct_per_class
221-
print(perct_per_class)
222-
all
223-
```
207+
Each subplot is a category vs the others. The ideal calibration plot is a diagonal line, represented by the gray dashed line, where the predicted probabilities are equal to the observed frequencies. If the line is above the diagonal, the model is underestimating the probabilities, and if the line is below the diagonal, the model is overestimating the probabilities.
208+
209+
+++
224210

225211
So far we have a very good result concerning the classification of the species based on the 5 covariables. However, if we want to select a subset of covariable to perform future classifications is not very clear which of them to select. Maybe something sure is that `Tail` could be eliminated. At the beginning when we plot the distribution of each covariable we said that the most important variables to make the classification could be `Wing`, `Weight` and, `Culmen`, nevertheless after running the model we saw that `Hallux`, `Culmen` and, `Wing`, proved to be the most important ones.
226212

@@ -270,35 +256,17 @@ With all these together, we can select `Hallux`, `Culmen`, and, `Wing` as covari
270256

271257
+++
272258

273-
Concerning the comparison between observed and predicted data, we obtain the same good result with less uncertainty for the predicted values (blue lines). And the same counts for the in-sample comparison.
274-
275-
```{code-cell} ipython3
276-
ax = az.plot_ppc(idata_t, kind="kde", num_pp_samples=100, random_seed=123)
277-
ax.set_ylim(0, 0.7)
278-
ax.set_yticks([0, 0.2, 0.4, 0.6])
279-
ax.set_ylabel("Probability")
280-
ax.set_xticks([0.5, 1.5, 2.5])
281-
ax.set_xticklabels(["CH", "RT", "SS"])
282-
ax.set_xlabel("Species");
283-
```
284-
285-
```{code-cell} ipython3
286-
np.mean((idata_t.posterior_predictive["y"] - y_0) == 0) * 100
287-
```
259+
Concerning the comparison between observed and predicted data, we can see that the model shows better calibration, as the lines are closer to the diagonal and the bands are in general less wide.
288260

289261
```{code-cell} ipython3
290-
all = 0
291-
for i in range(3):
292-
perct_per_class = np.mean(idata_t.posterior_predictive["y"].where(y_0 == i) == i) * 100
293-
all += perct_per_class
294-
print(perct_per_class)
295-
all
262+
az.plot_ppc_pava(idata_t, data_type="categorical");
296263
```
297264

298265
## Authors
299266
- Authored by [Pablo Garay](https://github.com/PabloGGaray) and [Osvaldo Martin](https://aloctavodia.github.io/) in May, 2024
300267
- Updated by Osvaldo Martin in Dec, 2024
301268
- Expanded by [Alex Andorra](https://github.com/AlexAndorra) in Feb, 2025
269+
- Updated by Osvaldo Martin in Dec, 2025
302270

303271
+++
304272

examples/bart/bart_heteroscedasticity.ipynb

Lines changed: 1459 additions & 118 deletions
Large diffs are not rendered by default.
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
formats: ipynb,md
5+
text_representation:
6+
extension: .md
7+
format_name: markdown
8+
format_version: '1.3'
9+
jupytext_version: 1.16.1
10+
kernelspec:
11+
display_name: Python 3 (ipykernel)
12+
language: python
13+
name: python3
14+
---
15+
16+
(bart_heteroscedasticity)=
17+
# Modeling Heteroscedasticity with BART
18+
19+
:::{post} January, 2023
20+
:tags: BART, regression
21+
:category: beginner, reference
22+
:author: Juan Orduz
23+
:::
24+
25+
26+
In this notebook we show how to use BART to model heteroscedasticity as described in Section 4.1 of [`pymc-bart`](https://github.com/pymc-devs/pymc-bart)'s paper {cite:p}`quiroga2022bart`. We use the `marketing` data set provided by the R package `datarium` {cite:p}`kassambara2019datarium`. The idea is to model a marketing channel contribution to sales as a function of budget.
27+
28+
```python
29+
import os
30+
31+
import arviz as az
32+
import matplotlib.pyplot as plt
33+
import numpy as np
34+
import pandas as pd
35+
import pymc as pm
36+
import pymc_bart as pmb
37+
```
38+
39+
```python
40+
%config InlineBackend.figure_format = "retina"
41+
az.style.use("arviz-darkgrid")
42+
plt.rcParams["figure.figsize"] = [10, 6]
43+
rng = np.random.default_rng(42)
44+
```
45+
46+
## Read Data
47+
48+
```python
49+
try:
50+
df = pd.read_csv(os.path.join("..", "data", "marketing.csv"), sep=";", decimal=",")
51+
except FileNotFoundError:
52+
df = pd.read_csv(pm.get_data("marketing.csv"), sep=";", decimal=",")
53+
54+
n_obs = df.shape[0]
55+
56+
df.head()
57+
```
58+
59+
## EDA
60+
61+
We start by looking into the data. We are going to focus on *Youtube*.
62+
63+
```python
64+
fig, ax = plt.subplots()
65+
ax.plot(df["youtube"], df["sales"], "o", c="C0")
66+
ax.set(title="Sales as a function of Youtube budget", xlabel="budget", ylabel="sales");
67+
```
68+
69+
We clearly see that both the mean and variance are increasing as a function of budget. One possibility is to manually select an explicit parametrization of these functions, e.g. square root or logarithm. However, in this example we want to learn these functions from the data using a BART model.
70+
71+
72+
## Model Specification
73+
74+
We proceed to prepare the data for modeling. We are going to use the `budget` as the predictor and `sales` as the response.
75+
76+
```python
77+
X = df["youtube"].to_numpy().reshape(-1, 1)
78+
Y = df["sales"].to_numpy()
79+
```
80+
81+
Next, we specify the model. Note that we just need one BART distribution which can be vectorized to model both the mean and variance. We use a Gamma distribution as likelihood as we expect the sales to be positive.
82+
83+
```python
84+
with pm.Model() as model_marketing_full:
85+
w = pmb.BART("w", X=X, Y=np.log(Y), m=100, shape=(2, n_obs))
86+
y = pm.Gamma("y", mu=pm.math.exp(w[0]), sigma=pm.math.exp(w[1]), observed=Y)
87+
88+
pm.model_to_graphviz(model=model_marketing_full)
89+
```
90+
91+
We now fit the model.
92+
93+
```python
94+
with model_marketing_full:
95+
idata_marketing_full = pm.sample(2000, random_seed=rng, compute_convergence_checks=False)
96+
posterior_predictive_marketing_full = pm.sample_posterior_predictive(
97+
trace=idata_marketing_full, random_seed=rng
98+
)
99+
```
100+
101+
## Results
102+
103+
We can now visualize the posterior predictive distribution of the mean and the likelihood.
104+
105+
```python
106+
posterior_mean = idata_marketing_full.posterior["w"].mean(dim=("chain", "draw"))[0]
107+
108+
w_hdi = az.hdi(ary=idata_marketing_full, group="posterior", var_names=["w"], hdi_prob=0.5)
109+
110+
pps = az.extract(
111+
posterior_predictive_marketing_full, group="posterior_predictive", var_names=["y"]
112+
).T
113+
```
114+
115+
```python
116+
idx = np.argsort(X[:, 0])
117+
118+
119+
fig, ax = plt.subplots()
120+
az.plot_hdi(
121+
x=X[:, 0],
122+
y=pps,
123+
ax=ax,
124+
hdi_prob=0.90,
125+
fill_kwargs={"alpha": 0.3, "label": r"Observations $90\%$ HDI"},
126+
)
127+
az.plot_hdi(
128+
x=X[:, 0],
129+
hdi_data=np.exp(w_hdi["w"].sel(w_dim_0=0)),
130+
ax=ax,
131+
fill_kwargs={"alpha": 0.6, "label": r"Mean $50\%$ HDI"},
132+
)
133+
ax.plot(df["youtube"], df["sales"], "o", c="C0", label="Raw Data")
134+
ax.legend(loc="upper left")
135+
ax.set(
136+
title="Sales as a function of Youtube budget - Posterior Predictive",
137+
xlabel="budget",
138+
ylabel="sales",
139+
);
140+
```
141+
142+
The fit looks good! In fact, we see that the mean and variance increase as a function of the budget.
143+
144+
145+
## Authors
146+
- Authored by [Juan Orduz](https://juanitorduz.github.io/) in Feb, 2023
147+
- Rerun by Osvaldo Martin in Mar, 2023
148+
- Rerun by Osvaldo Martin in Nov, 2023
149+
- Rerun by Osvaldo Martin in Dec, 2024
150+
151+
152+
## References
153+
:::{bibliography}
154+
:filter: docname in docnames
155+
:::
156+
157+
158+
## Watermark
159+
160+
```python
161+
%load_ext watermark
162+
%watermark -n -u -v -iv -w -p pytensor
163+
```
164+
165+
:::{include} ../page_footer.md
166+
:::

examples/bart/bart_heteroscedasticity.myst.md

Lines changed: 19 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jupytext:
66
format_name: myst
77
format_version: 0.13
88
kernelspec:
9-
display_name: Python 3 (ipykernel)
9+
display_name: pymc
1010
language: python
1111
name: python3
1212
---
@@ -27,7 +27,7 @@ In this notebook we show how to use BART to model heteroscedasticity as describe
2727
```{code-cell} ipython3
2828
import os
2929
30-
import arviz as az
30+
import arviz.preview as az
3131
import matplotlib.pyplot as plt
3232
import numpy as np
3333
import pandas as pd
@@ -37,7 +37,7 @@ import pymc_bart as pmb
3737

3838
```{code-cell} ipython3
3939
%config InlineBackend.figure_format = "retina"
40-
az.style.use("arviz-darkgrid")
40+
az.style.use("arviz-variat")
4141
plt.rcParams["figure.figsize"] = [10, 6]
4242
rng = np.random.default_rng(42)
4343
```
@@ -103,40 +103,26 @@ with model_marketing_full:
103103
We can now visualize the posterior predictive distribution of the mean and the likelihood.
104104

105105
```{code-cell} ipython3
106-
posterior_mean = idata_marketing_full.posterior["w"].mean(dim=("chain", "draw"))[0]
107-
108-
w_hdi = az.hdi(ary=idata_marketing_full, group="posterior", var_names=["w"], hdi_prob=0.5)
109-
110-
pps = az.extract(
111-
posterior_predictive_marketing_full, group="posterior_predictive", var_names=["y"]
112-
).T
106+
posterior_predictive_marketing_full
113107
```
114108

115109
```{code-cell} ipython3
116-
idx = np.argsort(X[:, 0])
117-
118-
119-
fig, ax = plt.subplots()
120-
az.plot_hdi(
121-
x=X[:, 0],
122-
y=pps,
123-
ax=ax,
124-
hdi_prob=0.90,
125-
fill_kwargs={"alpha": 0.3, "label": r"Observations $90\%$ HDI"},
126-
)
127-
az.plot_hdi(
128-
x=X[:, 0],
129-
hdi_data=np.exp(w_hdi["w"].sel(w_dim_0=0)),
130-
ax=ax,
131-
fill_kwargs={"alpha": 0.6, "label": r"Mean $50\%$ HDI"},
110+
dt_plot = az.from_dict(
111+
{
112+
"posterior_predictive": {
113+
"y": posterior_predictive_marketing_full.posterior_predictive["y"]
114+
},
115+
"observed_data": {"y": df["sales"].values},
116+
"constant_data": {"budget": X[:, 0]},
117+
},
118+
dims={
119+
"mean": ["budget_dim"],
120+
"y": ["budget_dim"],
121+
"budget": ["budget_dim"],
122+
},
132123
)
133-
ax.plot(df["youtube"], df["sales"], "o", c="C0", label="Raw Data")
134-
ax.legend(loc="upper left")
135-
ax.set(
136-
title="Sales as a function of Youtube budget - Posterior Predictive",
137-
xlabel="budget",
138-
ylabel="sales",
139-
);
124+
125+
az.plot_lm(dt_plot, x="budget", y="y");
140126
```
141127

142128
The fit looks good! In fact, we see that the mean and variance increase as a function of the budget.

0 commit comments

Comments
 (0)