Python: Unicodedecodeerror: 'utf8' Codec Can't Decode Byte
I'm reading a bunch of RTF files into python strings. On SOME texts, I get this error: Traceback (most recent call last): File '11.08.py', line 47, in X = vect
Solution 1:
This will solve your issues:
import codecs
f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()
from that moment txt is in unicode format and you can use it everywhere in your code.
If you want to generate UTF-8 files after your processing do:
f.write(txt.encode('utf-8'))
Solution 2:
as I said on the mailinglist, it is probably easiest to use the charset_error
option and set it to ignore
.
If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer.
See the docs.
Solution 3:
You can dump the csv file rows in json file without any encoding error as follows:
json.dump(row,jsonfile, encoding="ISO-8859-1")
Solution 4:
Keep this line :
vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')
encoding = 'latin-1' worked for me.
Post a Comment for "Python: Unicodedecodeerror: 'utf8' Codec Can't Decode Byte"