BUG: `read_stata` always uses 'utf8'

#### Code Sample, a copy-pastable example if possible

```python
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)
```
This raises `UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte`.
OK. So the file isn't a **utf8** one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a **latin-1** encoding:
```python
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)
```
This raises the same exception at exactly the same place (still **utf-8**).

#### Problem description

This is a problem because it appears that `read_stata` doesn't honour the `encoding` argument.
I think this line introduced a bug. The `StataReader` doesn't manage any other type of data than **ascii** or **latin-1**.

Changing the line **1338** of the `pandas.io.stata` module:
```python
        return s.decode('utf-8')
```
to:
```python
        return s.decode('latin-1')
```
Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:
```python
        return s.decode(self._encoding or self._default_encoding)
```
also seems to have made it work.

I believe though, that if you want to make this work with **Unicode** too you'd have to add the following encodings to `VALID_ENCODINGS`: **utf-8**, **utf8**, **iso10646**.

#### Expected Output
The file should be correctly read and parsed

#### Output of ``pd.show_versions()``

<details>
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: ro_RO.UTF-8
LANG: ro_RO.UTF-8
LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

BUG: `read_stata` always uses 'utf8' #21244

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of `pd.show_versions()`

19 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

BUG: read_stata always uses 'utf8' #21244

Description

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Activity

19 remaining items

leolovethewayyoulie commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

leolovethewayyoulie commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

leolovethewayyoulie commented on Mar 11, 2020

leolovethewayyoulie commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

leolovethewayyoulie commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

leolovethewayyoulie commented on Mar 11, 2020

bashtage commented on Mar 11, 2020

leolovethewayyoulie commented on Mar 11, 2020

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

BUG: `read_stata` always uses 'utf8' #21244

Output of `pd.show_versions()`