Lianjia Analysis Project

Lianjia Analysis Project

Ⅰ. Project introduction

Lianjia is a real estate agency, whose market share ranks first or second in many big cities in China. By making statistical graphs and prediction and sentiment analysis with techniques in supervised machine learning, I analyzed the second-hand housing market in Hangzhou and made some suggestions for Lianjia’s website and APP. During that process, I learned how to apply the technical skills acquired in marketing analytics class to solve real-world problems.

Ⅱ. Datasets and Preparation

1. Second-house information in Hangzhou published on Lianjia’s website

I scraped data by myself from hz.lianjia.com.
Lianjia Analysis ProjectI imported packages and prepared my Google drive. I used Selenium+Google drive to scrap data from website and then process data with package BeautifulSoup and pandas.

#library
import pandas as pd
import re
import time

from bs4 import BeautifulSoup
from selenium import webdriver

#set google driver
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 5)
pd.set_option('display.width',800)
driver = webdriver.Chrome(executable_path='C:\\Users\\W\\Desktop\\chromedriver-85.exe')

I observed that the link changes by adding ‘/pgn’ when clicking ‘next page’, so I used ‘for’ loop and ‘if else’ statement to change the link to swich page.

#set link
info_one_store=[]
for page in range(1, 101):
    one_info={}
    if page == 1:
        url = 'https://hz.lianjia.com/ershoufang/rs/'
    else:
        url = 'https://hz.lianjia.com/ershoufang/pg'+str(page)+'/'
    driver.get(url)

After getting that page, I scraped all information of houses by ‘xpath’ . Then I use ‘BeautifulSoup’ function to make the information readable. After that, I used regular expressions to find the information. I created a dictionary for each section and then added this dictionary to a specific list.

    #find information
    info=driver.find_elements_by_xpath("//div[@class='info clear']")
    r=0
    #clean
    for r in range(len(info)):
        one_info={}
        soup=BeautifulSoup(info[r].get_attribute('innerHTML'))
        
        try:
            one_info_totoalprice=soup.find('div',attrs={'class':'totalPrice'}).text[0:-1]
        except:
            one_info_totoalprice=""
        one_info['total_price']=float(one_info_totoalprice)
    
        try:
            one_info_unitprice=soup.find('div',attrs={'class':'unitPrice'}).text[2:-4]
        except:
            one_info_unitprice=""
        one_info['unit_price']=one_info_unitprice
    
        try:
            one_info_title=soup.find('div',attrs={'class':'title'}).text
        except:
            one_info_title=""
        one_info['title']=one_info_title
        
        try:
            one_info_location=soup.find('div',attrs={'class':'positionInfo'}).text
        except:
            one_info_location=""
        one_info['detail_address']=one_info_location.split('    -  ')[0]
        one_info['district']=one_info_location.split('    -  ')[1]
    
        try:
            one_info_houseinfo=soup.find('div',attrs={'class':'houseInfo'}).text
        except:
            one_info_houseinfo=""
        one_info['number_of_rooms']=one_info_houseinfo.split(' | ')[0][0]
        one_info['number_of_halls']=one_info_houseinfo.split(' | ')[0][2]
        one_info['size']=one_info_houseinfo.split(' | ')[1][0:-2]
        one_info['direction']=one_info_houseinfo.split(' | ')[2]
        one_info['renovation']=one_info_houseinfo.split(' | ')[3]
        one_info['floor']=one_info_houseinfo.split(' | ')[-2][0]
        one_info['building_structure']=one_info_houseinfo.split(' | ')[-1]
        
        try:
            one_info_followinfo=soup.find('div',attrs={'class':'followInfo'}).text
        except:
            one_info_followinfo="" 
        nf_list=follower_list=re.findall('\d',one_info_followinfo.split(' / ')[0])
        n=0
        number_of_follower=0
        for n in range(len(nf_list)):
            number_of_follower+=int(nf_list[n])*10**(len(nf_list)-1-n)
        one_info['number_of_followers']=number_of_follower
        
        info_one_store.append(one_info)
    #pause
    time.sleep(3)

At the end, I closed Google Drive, turned the list into dataframe using pandas and saved it as a csv file.

driver.close()
dta=pd.DataFrame.from_dict(info_one_store)
#output to csv file
dta.to_csv("C:\\Users\\W\\Desktop\\ma_datafile.csv",encoding='utf_8_sig')

2. Rates and Reviews of Lianjia APP on Apple Store during last four years

I downloaded from qimai.cn.

Ⅲ Analysis Process

1. Statistical graphs

(1)Preparation

I imported some packages: ‘pandas’ and ‘numpy’ to clean data, ‘pyplot’ and ‘seanorn’ to make plots, ‘jieba’ to split Chinese sentences for making ‘wordcloud’.
I used the second-house information in Hangzhou published on Lianjia’s website.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import jieba  #Chinese word segmentation
from wordcloud import WordCloud

During the cleaning process, I checked null and duplicates. And I translated some Chinese words in file into English and removed some unreasonable values.

##clean
#
dta=pd.read_csv('C:\\Users\\W\\Desktop\\ma_datafile.csv')
#check null: none
dta.info()
#check duplicate: none
check_dup=dta.loc[dta.duplicated()==True]
check_dup.head()
#translate Chinese to English
dta.loc[dta.renovation=='简装','renovation']='simple'
dta.loc[dta.renovation=='精装','renovation']='hardbound'
dta.loc[dta.renovation=='毛坯','renovation']='roughcast'
dta.loc[dta.renovation=='其他','renovation']='other'
dta.loc[dta.floor=='低','floor']='low'
dta.loc[dta.floor=='中','floor']='middle'
dta.loc[dta.floor=='高','floor']='high'
#remove some unreasonable values
dta.drop(index=(dta.loc[(dta['floor']!='low')&(dta['floor']!='middle')&(dta['floor']!='high')].index),inplace=True)
dta=dta.drop(dta[dta.building_structure=='暂无数据'].index)

(2) Overview Situation

#general
dta.describe()
#distribution of size, total price and unit price
sns.set_style({'font.sans-serif':['simhei','Arial']})
f, [ax1,ax2,ax3] = plt.subplots(3,1, figsize=(20, 10))
sns.distplot(dta['size'], bins=30, ax=ax1, color='r')
sns.kdeplot(dta['size'], shade=True, ax=ax1)
ax1.set_title('distribution of sizes')
sns.distplot(dta['total_price'], bins=30, ax=ax2, color='r')
sns.kdeplot(dta['total_price'], shade=True, ax=ax2)
ax2.set_title('distribution of total prices')
sns.distplot(dta['unit_price'], bins=30, ax=ax3, color='r')
sns.kdeplot(dta['unit_price'], shade=True, ax=ax3)
ax3.set_title('distribution of unit prices')
plt.show()

I plotted the distributions of size, total price and unit price of second-hand houses in Hangzhou. (unit of unit price: ¥1; unit of total price: ¥10,000)
Lianjia Analysis Project
From this chart, we can see that the size of houses is concentrated between 50 m 2 m^2 m2 and 100 m 2 m^2 m2, total price is mainly between ¥1,500,000 and ¥3,500,000, unit price is concentrated between ¥20,000 and ¥40,000.

(3) Influencing Factors of Price and Attention

a. Location

I found the top 10 districts in terms of price and number of followers. I used the number of followers on website as a measure of attention.

#top 10 districts(price/attention)
totalp_village = dta.groupby(['district'])['total_price'].mean().sort_values(ascending = False).reset_index().head(10)
totalp_village
follower_village = dta.groupby(['district'])['number_of_followers'].mean().sort_values(ascending = False).reset_index().head(10)
follower_village
  district  total_price
0    钱江新城    828.000000
1      南星    655.500000
2      学军    604.777778
3      西溪    595.578947
4   滨江区*    571.809302
5      申花    562.253731
6      文教    555.911765
7      奥体    542.000000
8    文一西路    540.195122
9   钱江世纪城    533.577551

  district  number_of_followers
0      城站            137.500000
1      湖滨             75.750000
2      潮鸣             71.157895
3      朝晖             62.687500
4     翡翠城             60.760000
5     拱宸桥             59.547619
6      和平             56.000000
7     白马湖             53.300000
8      望江             50.000000
9     彩虹城             49.416667

According to the results, there is no overlap between those two kinds of districts.

b. Renovation

I plotted the distribution of renovations and its relationship with total price, unit price and number of followers.

#price,attention and renovation
f, [ax1,ax2,ax3,ax4] = plt.subplots(1, 4, figsize=(20, 10))
sns.countplot(dta['renovation'], ax=ax1)
ax1.set_title('distribution of renovations')
sns.barplot(x='renovation', y='total_price', data=dta, ax=ax2)
ax2.set_title('total price and renovation')
sns.boxplot(x='renovation', y='total_price', data=dta, ax=ax3)
ax3.set_title('unit price and renovations')
sns.boxplot(x='renovation', y='number_of_followers', data=dta, ax=ax4)
ax4.set_title('attention and renovations')
plt.show()

Lianjia Analysis Project
According to the plot, ‘hardbound’ is the most common type of renovation, which makes sense for second-hand houses. And even though simple decoration is more decorated than roughcast, the price of simple decoration is lower.

c. Floor

I plotted the distribution of floors and its relationship with total price, unit price and number of followers.

#price,attention and floor
f, [ax1,ax2,ax3,ax4] = plt.subplots(1,4, figsize=(20, 10))
sns.countplot(dta['floor'], ax=ax1)
ax1.set_title('distribution of floors')
sns.barplot(x='floor', y='unit_price', data=dta, ax=ax2)
ax2.set_title('total price and floors')
sns.boxplot(x='floor', y='total_price', data=dta, ax=ax3)
ax3.set_title('unit price and floors')
sns.boxplot(x='floor', y='number_of_followers', data=dta, ax=ax4)
ax4.set_title('attention and floors')
plt.show()

Lianjia Analysis Project
According to the results, the influence of floors on price and attention is insignificant.

d. Size

I divided size into five categories: mini small, small, medium, big and huge.

#price,attention and size level
dta.loc[(dta['size']>=0)&(dta['size']<50),'size_level']="Mini Small"
dta.loc[(dta['size']>=50)&(dta['size']<100),'size_level']="Small"
dta.loc[(dta['size']>=100)&(dta['size']<150),'size_level']="Mediumn"
dta.loc[(dta['size']>=150)&(dta['size']<200),'size_level']="Big"
dta.loc[(dta['size']>=200),'size_level']="Huge"
dta_house_count1 = dta.groupby('size_level')['total_price'].count().sort_values(ascending=False).to_frame().reset_index()
dta_house_mean1 = dta.groupby('size_level')['unit_price'].mean().sort_values(ascending=False).to_frame().reset_index()
dta_house_mean1 = dta.groupby('size_level')['unit_price'].mean().sort_values(ascending=False).to_frame().reset_index()

Then I plotted the distribution of size level and its relationship with total price, unit price and the number of followers.

f, [ax1,ax2,ax3,ax4] = plt.subplots(1,4,figsize=(15,5))
sns.barplot(x='size_level', y='total_price', palette="Blues_d", data=dta_house_count1, ax=ax1)
ax1.set_title('distribution of size levels')
ax1.set_xlabel('size level')
ax1.set_ylabel('count')
sns.boxplot(x='size_level', y='total_price', data=dta, ax=ax2)
ax2.set_title('total price and size level')
ax2.set_xlabel('size level')
ax2.set_ylabel('total price')
sns.barplot(x='size_level', y='unit_price', palette="Blues_d", data=dta_house_mean1, ax=ax3)
ax3.set_title('unit price and size level')
ax3.set_xlabel('size level')
ax3.set_ylabel('unit price')
sns.barplot(x='size_level', y='number_of_followers', palette="Greens_d", data=dta, ax=ax4)
ax4.set_title('attention and size level')
ax4.set_xlabel('size level')
ax4.set_ylabel('attention')
plt.show()

Lianjia Analysis Project
According to the results, unit price of mini small houses is highest, and huge houses get most attention.

(3) Word Cloud for Title

I split Chinese words with ‘jieba.cut()’ function and downloaded a new font so that wordcloud() can identity them.

#generate title text
title_text=" ".join(dta.title.to_list())
#split text
title_sp = " ".join(jieba.cut(title_text))

I generated word cloud visualization to show high-frequency words in titles.

#generate plot
wordcloud = WordCloud(
    font_path='simsun.ttf',  
    max_words=2000, 
    stopwords={'必看好房','看','必','好','房','诚心','出售','看好','出售'},  # 设置停用词,停用词则不再词云图中表示
    max_font_size=150,  # 设置字体最大值
    random_state=1,  # 设置有多少种随机生成状态,即有多少种配色方案
    scale=1  # 设置生成的词云图的大小
    ).generate(title_sp)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

Lianjia Analysis Project
According to the picture, the most frenquent words are ‘hardbound’, ‘landload lived in the house’, ‘middle floor’ and ‘real estate certificate is five years old and the owner only has one house’.(about a tax rate policy)

2. House Price Prediction

(1)Preparation

a. Imported Packages and Dataset

I used the second-house information in Hangzhou published on Lianjia’s website.

# %%% 1. prepration
import pandas as pd
import numpy as np

from sklearn                         import linear_model
from sklearn.linear_model            import LinearRegression

pd.set_option('display.max_rows',     20)
pd.set_option('display.max_columns',  20)
pd.set_option('display.width',       800)
pd.set_option('display.max_colwidth', 20)
np.random.seed(1)

#input data
rawdta=pd.read_csv('C:\\Users\\W\\Desktop\\ma_datafile2.csv')

b. Variable Transformation

I transfromed categorical variables to dummies using pd.get_dummies().

#transform categorical variables to dummies
dta_numeric=rawdta.loc[:,['total_price','number_of_rooms','number_of_halls','size']]
dta_categorical=rawdta.loc[:,['district','direction','renovation','floor','building_structure']]
dta_dummy=pd.get_dummies(dta_categorical)
dta=dta_numeric.join(dta_dummy)

c. TVT Split

I split data into train, valid and test data. (train: estimate parameters in the signal; valid: decide hyper-parameters; test: test parameters)

#Performing the TVT-SPLIT
dta['ML_group']=np.random.randint(100,size = dta.shape[0])
dta               = dta.sort_values(by='ML_group')
inx_train         = dta.ML_group<80                     
inx_valid         = (dta.ML_group>=80)&(dta.ML_group<90)
inx_test          = (dta.ML_group>=90)
#generate X and Y
Y_train   = dta.total_price[inx_train].to_list()
Y_valid   = dta.total_price[inx_valid].to_list()
Y_test    = dta.total_price[inx_test].to_list()
X=dta.iloc[:,1:-2]
X_train   = X[inx_train]
X_valid   = X[inx_valid]
X_test    = X[inx_test]

(2)Prediction Algorithms

a. Linear Regressions

# %%% 2. linear regression
model  = LinearRegression()
clf = model.fit(X_train, Y_train)
clf.predict(X_test)

dta['price_level_hat_reg'] = np.concatenate(
        [
                clf.predict(X_train),
                clf.predict(X_valid),
                clf.predict(X_test )
        ]
        ).astype(float)

RMSPE_OLS=np.sqrt(np.mean((dta['price_level_hat_reg']-dta['total_price'])**2))

b. LASSO

# %%% 3. lasso regression
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X_train, Y_train)
dta['price_level_hat_lasso'] = np.concatenate(
        [
                clf.predict(X_train),
                clf.predict(X_valid),
                clf.predict(X_test )
        ]
        ).astype(float)

RMSPE_lasso=np.sqrt(np.mean((dta['price_level_hat_lasso']-dta['total_price'])**2))

(3)Results Comparasion

I compared the results of two algorithms by computing RMSPE. As a result, I should use LASSO to predict price, which has a lower RMSPE.

print(RMSPE_OLS>RMSPE_lasso)
#therefore, use lasso to predict price

3. Sentiment Analysis

(1)Data Preparation

a. Imported Packages and Data

I used the dataset containing the rates and reviews of Lianjia APP on Apple store.

# %%% 1. Importing the data
import pandas as pd
import os
import numpy as np

from sklearn                         import tree
from sklearn                         import linear_model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model            import LinearRegression
from sklearn.neighbors               import KNeighborsClassifier
from sklearn.datasets                import load_iris
from sklearn.model_selection         import train_test_split
from sklearn.naive_bayes             import GaussianNB
import jieba

pd.set_option('display.max_rows',     20)
pd.set_option('display.max_columns',  20)
pd.set_option('display.width',       800)
pd.set_option('display.max_colwidth', 20)

np.random.seed(1)

b. Data Cleaning

I deleted some unreasonable reviews and then split Chinese words in reviews with ‘jieba.cut()’ so that CountVectorizer() can identify thoes words in next step.

# %%% 2. Data splitting
dta               = pd.read_csv('C:\\Users\\W\\Desktop\\review_lianjia_20161207_20201207.csv').reset_index()
#delete some meaningless reviews
dta=dta[~dta['Reviews'].str.contains('该条评论已经被删除')].reset_index()
#need to split chinese reviews first because they don't be splited by space like English
dta['Review_split']=0
for i in range(len(dta.Reviews)):
    dta['Review_split'][i]=" ".join(list(jieba.cut(dta.Reviews[i])))
dta['ML_group']   = np.random.randint(100,size = dta.shape[0])
dta               = dta.sort_values(by='ML_group')
inx_train         = dta.ML_group<80                     
inx_valid         = (dta.ML_group>=80)&(dta.ML_group<90)
inx_test          = (dta.ML_group>=90)

c. Text Vectorizer

With CountVectorizer() function, I translated review text into numbers. During that process, I set max_df and min_df to remove words that appear too frequently(stop words) or too rarely.

# %%% 3. Putting structure in the text
corpus          = dta.Review_split
ngram_range     = (1,1)
max_df          = 0.85
min_df          = 0.01
vectorizer      = CountVectorizer(lowercase   = True,
                                  ngram_range = ngram_range,
                                  max_df      = max_df     ,
                                  min_df      = min_df     );
                                  
X               = vectorizer.fit_transform(corpus)

d. TVT Split

# %%% 4. Performing the TVT - SPLIT
Y_train   = dta.N_stars[inx_train].to_list()
Y_valid   = dta.N_stars[inx_valid].to_list()
Y_test    = dta.N_stars[inx_test].to_list()

X_train   = X[np.where(inx_train)[0],:]
X_valid   = X[np.where(inx_valid)[0],:]
X_test    = X[np.where(inx_test) [0],:]

(2)Prediction Algorithms

a. Linear Regressions

# %%% 5. Sentiment analysis using linear regression
model  = LinearRegression()
clf = model.fit(X_train, Y_train)
clf.predict(X_test)

dta['N_star_hat_reg'] = np.concatenate(
        [
                clf.predict(X_train),
                clf.predict(X_valid),
                clf.predict(X_test )
        ]
        ).round().astype(int)

dta.loc[dta['N_star_hat_reg']>5,'N_star_hat_reg'] = 5
dta.loc[dta['N_star_hat_reg']<1,'N_star_hat_reg'] = 1

I measured the performance of the algorithm by building a confusion matrix with test data.

#confusion matrix
conf_matrix      = np.zeros([5,5])

for i in range(5):
    for j in range(5):
        conf_matrix[i,j] = np.sum((dta[inx_test].N_stars==i+1)&(dta[inx_test].N_star_hat_reg==j+1))
right=conf_matrix.diagonal(offset=-1).sum()+conf_matrix.diagonal(offset=0).sum()
total=conf_matrix.sum()
print('linear regression:',right/total)

coef_reg=clf.coef_

b. K-NN

# %%%% 6. Sentiment analysis using k-nn
k            = 1;
results_list = [];
max_k_nn     = 10
for k in range(1,max_k_nn):
    clf      = KNeighborsClassifier(n_neighbors=k).fit(X_train, Y_train)
    results_list.append(
            np.concatenate(
                    [
                            clf.predict(X_train),
                            clf.predict(X_valid),
                            clf.predict(X_test)
                    ])
    )

dta_results_knn              = pd.DataFrame(results_list).transpose()
dta_results_knn['inx_train'] = inx_train.to_list()
dta_results_knn['inx_valid'] = inx_valid.to_list()
dta_results_knn['inx_test']  = inx_test.to_list()
dta_results_knn['N_stars'] = dta.N_stars.copy().astype(int)

I picked hyper-parameter k using valid data to help aviod overfitting problem.

#confusion matrix using valid data
conf_list = []
for e in range(9):
    conf_matrix      = np.zeros([5,5])
    for i in range(5):
        for j in range(5):
            conf_matrix[i,j] = np.sum((dta_results_knn[inx_valid].N_stars==(i+1))*(dta_results_knn[inx_valid][e]==(j+1)))
    right=conf_matrix.diagonal(offset=-1).sum()+conf_matrix.diagonal(offset=0).sum()
    total=conf_matrix.sum()
    conf_list.append(right/total)

I built final confusion matrix using the model that works best with the learned parameters and hyper-parameters based on test data.

#confusion matrix using test data
e_final=np.argmax(conf_list)
conf_matrix_final      = np.zeros([5,5])
for i in range(5):
    for j in range(5):
        conf_matrix_final[i,j] = np.sum((dta_results_knn[inx_test].N_stars==i+1)&(dta_results_knn[inx_test][e_final]==j+1))
right=conf_matrix_final.diagonal(offset=-1).sum()+conf_matrix_final.diagonal(offset=0).sum()
total=conf_matrix_final.sum()
print('k-nn:',right/total)

c. Naive Bayes Classification

# %%% 7. Sentiment analysis using Naive Bayes Classification
clf                              = GaussianNB().fit(X_train.toarray(), Y_train)
dta['N_star_hat_NB']             = np.concatenate(
        [
                clf.predict(X_train.toarray()),
                clf.predict(X_valid.toarray()),
                clf.predict(X_test.toarray( ))
        ]).round().astype(int)
dta.loc[dta['N_star_hat_NB']>5,'N_star_hat_NB'] = 5
dta.loc[dta['N_star_hat_NB']<1,'N_star_hat_NB'] = 1

for i in range(5):
    for j in range(5):
        conf_matrix[i,j] = np.sum((dta[inx_test].N_stars==i+1)&(dta[inx_test].N_star_hat_NB==j+1))
right=conf_matrix.diagonal(offset=-1).sum()+conf_matrix.diagonal(offset=0).sum()
total=conf_matrix.sum()
print('NB:',right/total)

d. Decision Trees

# %%% 8. Sentiment analysis using trees
criterion     = ['entropy','gini']
random_state         = 96
max_depth            = 10
results_list         = []
conf_list = []
for criterion_chosen in criterion:  
    for depth in range(2,max_depth):
        clf    = tree.DecisionTreeClassifier(
                criterion    = criterion_chosen, 
                max_depth    = depth,
                random_state = 96).fit(X_train.toarray(), Y_train)

        results_list.append(
                np.concatenate(
                        [
                                clf.predict(X_train.toarray()),
                                clf.predict(X_valid.toarray()),
                                clf.predict(X_test.toarray( ))
                        ]).round().astype(int)
                )
        
dta_results_tree              = pd.DataFrame(results_list).transpose()
dta_results_tree['inx_train'] = inx_train.to_list()
dta_results_tree['inx_valid'] = inx_valid.to_list()
dta_results_tree['inx_test']  = inx_test.to_list()
dta_results_tree['N_stars'] = dta.N_stars.copy().astype(int)

#confusion matrix using valid data
for e in range(16):
    conf_matrix      = np.zeros([5,5])
    for i in range(5):
        for j in range(5):
            conf_matrix[i,j] = np.sum((dta_results_tree[inx_valid].N_stars==(i+1))*(dta_results_tree[inx_valid][e]==(j+1)))
    right=conf_matrix.diagonal(offset=-1).sum()+conf_matrix.diagonal(offset=0).sum()
    total=conf_matrix.sum()
    conf_list.append(right/total)
#confusion matrix using test data
e_final=np.argmax(conf_list)
conf_matrix_final      = np.zeros([5,5])
for i in range(5):
    for j in range(5):
        conf_matrix_final[i,j] = np.sum((dta_results_tree[inx_test].N_stars==i+1)&(dta_results_tree[inx_test][e_final]==j+1))
right=conf_matrix_final.diagonal(offset=-1).sum()+conf_matrix_final.diagonal(offset=0).sum()
total=conf_matrix_final.sum()
print('tree:',right/total)

e. LASSO

# %%% 9. Sentiment analysis using lasso regression
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X_train, Y_train)
dta['N_star_hat_lasso'] = np.concatenate(
        [
                clf.predict(X_train),
                clf.predict(X_valid),
                clf.predict(X_test )
        ]
        ).round().astype(int)

dta.loc[dta['N_star_hat_lasso']>5,'N_star_hat_lasso'] = 5
dta.loc[dta['N_star_hat_lasso']<1,'N_star_hat_lasso'] = 1

conf_matrix      = np.zeros([5,5])
for i in range(5):
    for j in range(5):
        conf_matrix[i,j] = np.sum((dta[inx_test].N_stars==i+1)&(dta[inx_test].N_star_hat_lasso==j+1))
right=conf_matrix.diagonal(offset=-1).sum()+conf_matrix.diagonal(offset=0).sum()
total=conf_matrix.sum()
print('lasso:',right/total)

(3)Further Analysis

I compared the performance of these algorithms through the results of confusion matrix and picked linear regression and LASSO.

#results: 
linear regression: 0.7169117647058824
k-nn: 0.6121323529411765
NB: 0.6011029411764706
tree: 0.7077205882352942
lasso: 0.7463235294117647

I built a dataframe containing words in reviews and their corresponding coefficients in Linear Regression model.

# %%% 10. results analysis after choosing linear regression
results=pd.DataFrame()
results['words']=vectorizer.get_feature_names()
results['coefficient']=coef_reg
results.loc[results['coefficient']>0,'words'].to_list()
results.loc[results['coefficient']<0,'words'].to_list()

According to the results, some words such as ‘customer service hotline’, ‘agents’, and ‘calls’ are negatively correlated with rates of APP on Apple Store. This provides some insights for company management.

Ⅳ. Reflections and Conclusions

1. Reflections

(1)Chinese Sentences Split

The main problem I encountered during the analysis process is how to transform Chinese into a form that can be processed. For example, I needed to download a new font so that Chinese words can be displayed in wordcloud picture. And because Chinese words are not separated by spaces in sentences like English, I also needed to import a new package called ‘jieba’ to split them so that ‘Countervectorizer()’ can identiy them.

(2)House Price Prediction Function on Website

There is a valuation function on Lianjia’s website so people can know the apporximate price before posting their houses on the website. My house price prediction part is trying to sitimulate this function but there are still some problems. For example, the variables are not enough because I didn’t scrap enough data. And the ‘district’ variable is not detailed. Those problems caused some error of my prediction results.

2. Conclusions

(1)Second-hand housing market in Hangzhou

The market of second-hand houses has its own characteristics compared to new houses. First of all, hardbound is the most common type of renovation for second-hand houses while most new houses are just roughcast in China. This made the investigation of decoration more complicated and important. Second, even though ‘middle floor’ appeared in the tiltle frequently as an advantage of the house, the influence of floors on price is insignificant. On the contrary, the price of a new house varies from floor to floor. Third, the relevant policy of second-hand houses is more complicated such as house purchase qualification and property tax.

(2)Suggestions of APP reviews on Apple store

The company can use sentiment analysis to monitor APP’s rates and reviews and find problems of management in time.

Source

Python数据分析实战-链家北京二手房价分析
如何用Python做中文词云
python词云 wordcloud+jieba生成中文词云图
对中文汉字进行特征提取
七麦数据

上一篇:格式化数字,千分位符,有效数字,汉字显示


下一篇:jmeter(二)ant报告模板下载与使用