How to avoid decoding to str: need a bytes-like object error in pandas?

data = pd.read_csv('asscsv2.csv', encoding = "ISO-8859-1", error_bad_lines=False);
data_text = data[['content']]
data_text['index'] = data_text.index
documents = data_text


                                              content  index
 0  Pretty extensive background in Egyptology and ...      0
 1  Have you guys checked the back end of the Sphi...      1


stemmer = PorterStemmer()
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
    return result
processed_docs = documents['content'].map(preprocess)


TypeError: decoding to str: need a bytes-like object, float found

This :

processed_docs = documents['content'].map(preprocess)

is because the data frame in some cells has NaN values that can not be preprocessed, for that, you have to drop:

documents.dropna(subset = ["content"], inplace=True) # drop those rows which have NaN value cells

those unrequired rows and then apply the preprocessing.

Your data has NaNs(not a number).

You can either drop them first:

documents = documents.dropna(subset=['content'])

Or, you can fill all NaNs with an empty string, convert the column to string type and then map your string based function.


This is because your function preprocess has function calls that accept string only data type.

