Description
When dplyr
or dbplyr
joins data frames or tables, the column names are case sensitive, so two columns with the same name but different case are considered different. When generating the SQL, table aliases are only prefixed to column names in the select clause when they are not unique column names. However, since the case of column names in a SQL statement is not considered, at least not in MariaDB, it results in an error as seen below.
library(odbc)
library(dbplyr)
conn <- dbConnect(drv = odbc::odbc(),
Driver = "MySQL")
# create a simple table called test_table1 with one column called COL_1 (all caps)
dbExecute(conn, "create table test_table1 as
select 1 as `COL_1`
union select 2 as `COL_1`
union select 3 as `COL_1`")
# create another table called test_table2 with the same structure of test_table1 but with the column called col_1 (lowercase)
dbExecute(conn, "create table test_table2 as
select 1 as `col_1`
union select 2 as `col_1`
union select 3 as `col_1`")
# join the two tables and collect results
tbl(conn, "test_table1") %>%
left_join(tbl(conn, "test_table2"), by = c("COL_1" = "col_1")) %>%
collect()
# Error in new_result(connection@ptr, statement, immediate) :
# nanodbc/nanodbc.cpp:1412: 00000: [RStudio][MySQL] Column 'COL_1' in field list is ambiguous
# The query being executed
tbl(conn, "test_table1") %>%
left_join(tbl(conn, "test_table2"), by = c("COL_1" = "col_1")) %>%
show_query()
# <SQL>
# SELECT `COL_1`
# FROM `test_table1` AS `LHS`
# LEFT JOIN `test_table2` AS `RHS`
# ON (`LHS`.`COL_1` = `RHS`.`col_1`)
The error occurs because both tables have a column named col_1
and case is not considered when deciding which one to select. If the table alias was included in the select clause, it would have worked.
When the column names are identical, there's no issue.
# create test_table3 which is identical to test_table1
dbExecute(conn, "create table test_table3 as
select 1 as `COL_1`
union select 2 as `COL_1`
union select 3 as `COL_1`")
# join the two tables and collect results
tbl(conn, "test_table1") %>%
left_join(tbl(conn, "test_table3"), by = "COL_1") %>%
collect()
## A tibble: 3 × 1
# COL_1
# <int>
#1 1
#2 2
#3 3
# The query being executed
tbl(conn, "test_table1") %>%
left_join(tbl(conn, "test_table3"), by = "COL_1") %>%
show_query()
# <SQL>
# SELECT `LHS`.`COL_1` AS `COL_1`
# FROM `test_table1` AS `LHS`
# LEFT JOIN `test_table3` AS `RHS`
# ON (`LHS`.`COL_1` = `RHS`.`COL_1`)
Now there's no issue because the column name is prefixed with the table alias in the select clause.
This doesn't work as expected due to the case differences:
tbl(conn, "test_table1") %>%
left_join(tbl(conn, "test_table2"), by = "COL_1") %>%
collect()
# Error: `by` can't contain join column `COL_1` which is missing from RHS.
It would be useful if column names were always prefixed with the table alias, or if there was the option to allow this.