原文:
Goodreads-books
comprehensive list of all books listed in goodreads
The primary reason for creating this dataset is the requirement of a good clean dataset of books. Being a bookie myself (see what I did there?) I had searched for datasets on books in kaggle itself - and I found out that while most of the datasets had a good amount of books listed, there were either a) major columns missing or b) grossly unclean data. I mean, you can't determine how good a book is just from a few text reviews, come on! What I needed were numbers, solid integers and floats that say how many people liked the book or hated it, how much did they like it, and stuff like that. Even the good dataset that I found was well-cleaned, it had a number of interlinked files, which increased the hassle. This prompted me to use the Goodreads API to get a well-cleaned dataset, with the promising features only ( minus the redundant ones ), and the result is the dataset you're at now.
译:
好书
古德雷兹所列全部书籍的综合清单
创建这个数据集的主要原因是需要一个干净的图书数据集。我自己是个赌徒(看到我在那里做了什么了吗?)我在kaggle自己的书中搜索了数据集,我发现,虽然大多数数据集都列出了大量的书,但要么是a)主要列缺失,要么是b)数据极不干净。我的意思是,你不能仅仅从几篇课文评论就决定一本书有多好,拜托!我需要的是数字、实心整数和浮点数,这些数字可以表示有多少人喜欢或讨厌这本书,有多少人喜欢这本书,等等。即使是我发现的好数据集也很干净,它有许多相互关联的文件,这增加了麻烦。这促使我使用GoodReadsAPI来获得一个干净的数据集,只包含有希望的特性(减去多余的特性),结果就是现在的数据集。
大家可以到官网地址下载数据集,我自己也在百度网盘分享了一份。可关注本人公众号,回复“2020122901”获取下载链接。