-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Code Sample, a copy-pastable example if possible
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
pass # do something with chunk (never reached)
This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte
.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
pass # do something with chunk (never reached)
This raises the same exception at exactly the same place (still utf-8).
Problem description
This is a problem because it appears that read_stata
doesn't honour the encoding
argument.
I think this line introduced a bug. The StataReader
doesn't manage any other type of data than ascii or latin-1.
Changing the line 1338 of the pandas.io.stata
module:
return s.decode('utf-8')
to:
return s.decode('latin-1')
Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:
return s.decode(self._encoding or self._default_encoding)
also seems to have made it work.
I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS
: utf-8, utf8, iso10646.
Expected Output
The file should be correctly read and parsed
Output of pd.show_versions()
pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Activity
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
BUG: Fix handling of encoding for the StataReader pandas-dev#21244
19 remaining items
leolovethewayyoulie commentedon Mar 11, 2020
Hi I am having the same issue. When I exported stata file to csv file and added pd.read_csv("file csv", encoding = "latin-1"), it worked. But when I added that to pd.read_stata("file dta" , encoding = "latin-1), it happened "Futurewarning encoding is..."). Even when I tried your ways, it's still the same, nothing changed (even the _null_terminate....)
Can you have any suggestion for me? Thank you!
bashtage commentedon Mar 11, 2020
What version is the DTA file you are creating?
leolovethewayyoulie commentedon Mar 11, 2020
stata 16
I read this version to find out that its' encode is "ISO-8859-1"
I have already exported the dta to csv, and using encode worked.
But the problem with encoding in read_stata is "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
"""Entry point for launching an IPython kernel."
:(
bashtage commentedon Mar 11, 2020
bashtage commentedon Mar 11, 2020
FWIW "ISO-8859-1" is latin-1.
leolovethewayyoulie commentedon Mar 11, 2020
Sure, but since it is really heavy, I might send it through email, can I have your email, I will send with my csv as well.
Thank you so much
leolovethewayyoulie commentedon Mar 11, 2020
Yeap, so what I'm trying to say is the dta file is encoded "latin-1" since the exported-csv file from this dta file can be read with encoded "ISO-8859-1". In another word, here is my situation:
"""Entry point for launching an IPython kernel."
bashtage commentedon Mar 11, 2020
You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com
leolovethewayyoulie commentedon Mar 11, 2020
I have sent you my data through google drive
Thank you so much for your help!
bashtage commentedon Mar 11, 2020
AFAICT pandas reads the file correctly. You get a warning that the file does not have the correct format. This warning is correct since this is a stata DTA 118 file which must b utf-8 encoded per Stata's dta documentation. However, it is latin-1 encoded. This happens when an older dta file is loaded into Stata and then saved in 118 format. If you think this should be fixed, you should contact Stata since this is their bug.
bashtage commentedon Mar 11, 2020
Works in pandas 1.0.1.
leolovethewayyoulie commentedon Mar 11, 2020
Okie, I'll install pandas 1.0.1 to try
In the meantime, can you give me your command?
Thank you so much
bashtage commentedon Mar 11, 2020
leolovethewayyoulie commentedon Mar 11, 2020
Haha, thank you so much dude,
since I install the newest version, it worked although it still has the warning but I guess it's alright @@
Thank you so much ❤️❤️❤️❤️❤️