我又又又搬运了以前的存货来分享啦!
项目来源:
https://www.kaggle.com/hugomathien/soccer
项目简介:
利用SQL和pandas对11支球队在7个赛季中的25979场比赛数据,分析各球队在每个赛季的主客场得分情况以及联赛积分情况。
PS: 项目本身提供的是sqlite文件,为了更贴近现实中的工作场景,我将其中的表都输出成csv文件,再利用Navicat for MySQL 导入到MYSQL当中。(文末附资源链接)
本次项目同样在jupyter上运行。
导入模块:
import pymysql
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 设置汉字字体,优先使用黑体
plt.rcParams['font.size'] = 12 # 设置字体大小
plt.rcParams['axes.unicode_minus'] = False # 设置正常显示负号
数据库中的表格:
conn = pymysql.connect(
host = 'localhost',
user = 'root',
password = '',
db = 'data',
port = 3306
)
df = pd.read_sql("show tables",conn)
df
国家名单:
countries = pd.read_sql("""SELECT *
FROM Country;""", conn)
countries
将country表和league表连接起来:
leagues = pd.read_sql("""SELECT *
FROM League
JOIN Country ON Country.id = League.country_id;""", conn)
leagues
按队名排序中的前十名:
teams = pd.read_sql("""SELECT *
FROM Team
ORDER BY team_long_name
LIMIT 10;""", conn)
teams
输出spain主客队比赛的信息:
detailed_matches = pd.read_sql("""SELECT Match.id,
Country.name AS country_name,
League.name AS league_name,
season,
stage,
date,
HT.team_long_name AS home_team,
AT.team_long_name AS away_team,
home_team_goal,
away_team_goal
FROM `match`
JOIN Country on Country.id = Match.country_id
JOIN League on League.id = Match.league_id
LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
WHERE country.name = 'Spain'
ORDER by date
LIMIT 10;""", conn)
detailed_matches
统计各个国家的各个联赛的各个赛季中stage大于10的球队主客队平均得分,主客队平均分之和与差,以及总和:
leages_by_season = pd.read_sql("""SELECT Country.name AS country_name,
League.name AS league_name,
season,
count(distinct stage) AS number_of_stages,
count(distinct HT.team_long_name) AS number_of_teams,
avg(home_team_goal) AS avg_home_team_scors,
avg(away_team_goal) AS avg_away_team_goals,
avg(home_team_goal-away_team_goal) AS avg_goal_dif,
avg(home_team_goal+away_team_goal) AS avg_goals,
sum(home_team_goal+away_team_goal) AS total_goals
FROM `match`
JOIN Country on Country.id = Match.country_id
JOIN League on League.id = Match.league_id
LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
WHERE country.name in ('Spain', 'Germany', 'France', 'Italy', 'England')
GROUP BY Country.name, League.name, season
HAVING count(distinct stage) > 10
ORDER BY Country.name, League.name, season DESC
;""", conn)
leages_by_season.head(10)
列出几支球队的各赛季平均得分趋势图:
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goals'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goals'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goals'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goals'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goals'])
df.plot(figsize=(12,5),title='各赛季平均得分趋势图')
列出几支球队的各赛季平均主客队分差趋势图:
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_dif'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goal_dif'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goal_dif'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_dif'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_dif'])
df.plot(figsize=(12,5),title='各赛季平均主客队分差趋势图')
求各赛季各球队的积分(赢球得3分,平局得1分,输球不得分):
team_season_score = pd.read_sql("""SELECT
m.season,
t.team_long_name,
t.team_short_name,
t.team_api_id,
COUNT(*) AS 'matches_played',
SUM(
CASE
WHEN (t.team_api_id=m.home_team_api_id AND m.home_team_goal>m.away_team_goal) OR
(t.team_api_id=m.away_team_api_id AND m.home_team_goal<m.away_team_goal) THEN 3
WHEN (t.team_api_id=m.home_team_api_id AND m.home_team_goal=m.away_team_goal) OR
(t.team_api_id=m.away_team_api_id AND m.home_team_goal=m.away_team_goal) THEN 1
WHEN (t.team_api_id=m.home_team_api_id AND m.home_team_goal<m.away_team_goal) OR
(t.team_api_id=m.away_team_api_id AND m.home_team_goal>m.away_team_goal) THEN 0
END
) AS 'points',
SUM(IF(t.team_api_id=m.home_team_api_id, m.home_team_goal, m.away_team_goal)) AS 'goal_for',
SUM(IF(t.team_api_id=m.home_team_api_id, m.away_team_goal, m.home_team_goal)) AS 'goal_against',
SUM(IF(t.team_api_id=m.home_team_api_id, m.home_team_goal, m.away_team_goal))-SUM(IF(t.team_api_id=m.home_team_api_id, m.away_team_goal, m.home_team_goal)) AS 'goal_diff'
FROM Team t JOIN `match` m ON t.team_api_id=m.home_team_api_id OR t.team_api_id=m.away_team_api_id
WHERE t.team_api_id in (10260, 8634, 8650, 9904, 8636)
GROUP BY team_api_id, season
ORDER BY season, points DESC, goal_diff DESC, team_long_name;""", conn)
team_season_score.head(10)
列出几支球队的各赛季积分趋势图:
df = pd.DataFrame(index=np.sort(team_season_score['season'].unique()), columns=team_season_score['team_short_name'].unique())
df.loc[:,'MUN'] = list(team_season_score.loc[team_season_score['team_short_name']=='MUN','points'])
df.loc[:,'BAR'] = list(team_season_score.loc[team_season_score['team_short_name']=='BAR','points'])
df.loc[:,'LIV'] = list(team_season_score.loc[team_season_score['team_short_name']=='LIV','points'])
df.loc[:,'HAN'] = list(team_season_score.loc[team_season_score['team_short_name']=='HAN','points'])
df.loc[:,'INT'] = list(team_season_score.loc[team_season_score['team_short_name']=='INT','points'])
df.plot(figsize=(12,5),title='各赛季积分趋势图')
结束语
此次项目的数据提供了不少数据,就连球员的EA SPORT游戏中的数据都有,不过这个项目我主要也是拿来练手,并没有刻意去将全部的数据都拿来分析,大概感兴趣的话也可以自己尝试一下。
我把最初kaggle提供的sqlite文件,后面我自己输出的csv文件和sql文件以及本次完整代码都放到了网盘上,需要的朋友自行下载:
链接:https://pan.baidu.com/s/1CbrdX0q18W9v-PgNyZfRPg
提取码:1024