Skip to content Skip to sidebar Skip to footer

Python: Unicodedecodeerror: 'utf8' Codec Can't Decode Byte

I'm reading a bunch of RTF files into python strings. On SOME texts, I get this error: Traceback (most recent call last): File '11.08.py', line 47, in X = vect

Solution 1:

This will solve your issues:

import codecs

f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()

from that moment txt is in unicode format and you can use it everywhere in your code.

If you want to generate UTF-8 files after your processing do:

f.write(txt.encode('utf-8'))

Solution 2:

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore. If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer. See the docs.

Solution 3:

You can dump the csv file rows in json file without any encoding error as follows:

json.dump(row,jsonfile, encoding="ISO-8859-1")

Solution 4:

Keep this line :

vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')

encoding = 'latin-1' worked for me.

Post a Comment for "Python: Unicodedecodeerror: 'utf8' Codec Can't Decode Byte"