Skip to content

BUG: read_stata always uses 'utf8' #21244

@adrian-castravete

Description

@adrian-castravete

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn't honour the encoding argument.
I think this line introduced a bug. The StataReader doesn't manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Activity

added this to the 0.24.0 milestone on May 30, 2018
removed this from the 0.24.0 milestone on Jun 5, 2018

19 remaining items

leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:

def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

Hi I am having the same issue. When I exported stata file to csv file and added pd.read_csv("file csv", encoding = "latin-1"), it worked. But when I added that to pd.read_stata("file dta" , encoding = "latin-1), it happened "Futurewarning encoding is..."). Even when I tried your ways, it's still the same, nothing changed (even the _null_terminate....)
Can you have any suggestion for me? Thank you!

bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor

What version is the DTA file you are creating?

leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie

What version is the DTA file you are creating?

stata 16
I read this version to find out that its' encode is "ISO-8859-1"
I have already exported the dta to csv, and using encode worked.
But the problem with encoding in read_stata is "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
"""Entry point for launching an IPython kernel."
:(

bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor
bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor

FWIW "ISO-8859-1" is latin-1.

leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie

Sure, but since it is really heavy, I might send it through email, can I have your email, I will send with my csv as well.
Thank you so much

leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie

FWIW "ISO-8859-1" is latin-1.

Yeap, so what I'm trying to say is the dta file is encoded "latin-1" since the exported-csv file from this dta file can be read with encoded "ISO-8859-1". In another word, here is my situation:

  • a = pd.read_stata("E:\file.dta", encoding = "ISO-8859-1") --> Dont work, result:"C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
    """Entry point for launching an IPython kernel."
  • b = pd.read_csv("E:\file(exported from file dta).csv", encoding="ISO-8859-1") worked
bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor

You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com

leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie

You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com

I have sent you my data through google drive
Thank you so much for your help!

bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor

AFAICT pandas reads the file correctly. You get a warning that the file does not have the correct format. This warning is correct since this is a stata DTA 118 file which must b utf-8 encoded per Stata's dta documentation. However, it is latin-1 encoded. This happens when an older dta file is loaded into Stata and then saved in 118 format. If you think this should be fixed, you should contact Stata since this is their bug.

bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor

Works in pandas 1.0.1.

leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie

Works in pandas 1.0.1.

Okie, I'll install pandas 1.0.1 to try
In the meantime, can you give me your command?
Thank you so much

bashtage

bashtage commented on Mar 11, 2020

@bashtage
Contributor
import pandas as pd
pd.read_stata("data.dta")
leolovethewayyoulie

leolovethewayyoulie commented on Mar 11, 2020

@leolovethewayyoulie
import pandas as pd
pd.read_stata("data.dta")

Haha, thank you so much dude,
since I install the newest version, it worked although it still has the warning but I guess it's alright @@
Thank you so much ❤️❤️❤️❤️❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO Stataread_stata, to_stataUnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @jreback@jorisvandenbossche@adrian-castravete@toobaz@hudcap

      Issue actions

        BUG: `read_stata` always uses 'utf8' · Issue #21244 · pandas-dev/pandas