基于SQL和pandas的欧洲足球数据分析【附图详解】

我又又又搬运了以前的存货来分享啦!

项目来源:

https://www.kaggle.com/hugomathien/soccer

项目简介:

利用SQL和pandas对11支球队在7个赛季中的25979场比赛数据,分析各球队在每个赛季的主客场得分情况以及联赛积分情况。

PS: 项目本身提供的是sqlite文件,为了更贴近现实中的工作场景,我将其中的表都输出成csv文件,再利用Navicat for MySQL 导入到MYSQL当中。(文末附资源链接)
本次项目同样在jupyter上运行。

导入模块:

import pymysql
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 设置汉字字体,优先使用黑体
plt.rcParams['font.size'] = 12  # 设置字体大小
plt.rcParams['axes.unicode_minus'] = False   # 设置正常显示负号

数据库中的表格:

conn = pymysql.connect(
    host = 'localhost',
    user = 'root',
    password = '',
    db = 'data',
    port = 3306
)
df = pd.read_sql("show tables",conn)
df

基于SQL和pandas的欧洲足球数据分析【附图详解】
国家名单:

countries = pd.read_sql("""SELECT *
                        FROM Country;""", conn)
countries

基于SQL和pandas的欧洲足球数据分析【附图详解】
将country表和league表连接起来:

leagues = pd.read_sql("""SELECT *
                        FROM League
                        JOIN Country ON Country.id = League.country_id;""", conn)
leagues

基于SQL和pandas的欧洲足球数据分析【附图详解】
按队名排序中的前十名:

teams = pd.read_sql("""SELECT *
                        FROM Team
                        ORDER BY team_long_name
                        LIMIT 10;""", conn)
teams

基于SQL和pandas的欧洲足球数据分析【附图详解】
输出spain主客队比赛的信息:

detailed_matches = pd.read_sql("""SELECT Match.id, 
                                        Country.name AS country_name, 
                                        League.name AS league_name, 
                                        season, 
                                        stage, 
                                        date,
                                        HT.team_long_name AS  home_team,
                                        AT.team_long_name AS away_team,
                                        home_team_goal, 
                                        away_team_goal                                        
                                FROM `match`
                                JOIN Country on Country.id = Match.country_id
                                JOIN League on League.id = Match.league_id
                                LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
                                LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
                                WHERE country.name = 'Spain'
                                ORDER by date
                                LIMIT 10;""", conn)
detailed_matches

基于SQL和pandas的欧洲足球数据分析【附图详解】
统计各个国家的各个联赛的各个赛季中stage大于10的球队主客队平均得分,主客队平均分之和与差,以及总和:

leages_by_season = pd.read_sql("""SELECT Country.name AS country_name, 
                                        League.name AS league_name, 
                                        season,
                                        count(distinct stage) AS number_of_stages,
                                        count(distinct HT.team_long_name) AS number_of_teams,
                                        avg(home_team_goal) AS avg_home_team_scors, 
                                        avg(away_team_goal) AS avg_away_team_goals, 
                                        avg(home_team_goal-away_team_goal) AS avg_goal_dif, 
                                        avg(home_team_goal+away_team_goal) AS avg_goals, 
                                        sum(home_team_goal+away_team_goal) AS total_goals                                       
                                FROM `match`
                                JOIN Country on Country.id = Match.country_id
                                JOIN League on League.id = Match.league_id
                                LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
                                LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
                                WHERE country.name in ('Spain', 'Germany', 'France', 'Italy', 'England')
                                GROUP BY Country.name, League.name, season
                                HAVING count(distinct stage) > 10
                                ORDER BY Country.name, League.name, season DESC
                                ;""", conn)
leages_by_season.head(10)

基于SQL和pandas的欧洲足球数据分析【附图详解】
列出几支球队的各赛季平均得分趋势图:

df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())

df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goals'])
df.loc[:,'Spain']   = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goals'])
df.loc[:,'France']   = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goals'])
df.loc[:,'Italy']   = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goals'])
df.loc[:,'England']   = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goals'])

df.plot(figsize=(12,5),title='各赛季平均得分趋势图')

基于SQL和pandas的欧洲足球数据分析【附图详解】
列出几支球队的各赛季平均主客队分差趋势图:

df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())

df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_dif'])
df.loc[:,'Spain']   = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goal_dif'])
df.loc[:,'France']   = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goal_dif'])
df.loc[:,'Italy']   = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_dif'])
df.loc[:,'England']   = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_dif'])

df.plot(figsize=(12,5),title='各赛季平均主客队分差趋势图')

基于SQL和pandas的欧洲足球数据分析【附图详解】
求各赛季各球队的积分(赢球得3分,平局得1分,输球不得分):

team_season_score = pd.read_sql("""SELECT 
		m.season,
    t.team_long_name,
		t.team_short_name,	
        t.team_api_id,
    COUNT(*) AS 'matches_played', 
    SUM(
        CASE 
            WHEN (t.team_api_id=m.home_team_api_id AND m.home_team_goal>m.away_team_goal) OR
                (t.team_api_id=m.away_team_api_id AND m.home_team_goal<m.away_team_goal) THEN 3
            WHEN (t.team_api_id=m.home_team_api_id AND m.home_team_goal=m.away_team_goal) OR
                (t.team_api_id=m.away_team_api_id AND m.home_team_goal=m.away_team_goal) THEN 1
            WHEN (t.team_api_id=m.home_team_api_id AND m.home_team_goal<m.away_team_goal) OR
                (t.team_api_id=m.away_team_api_id AND m.home_team_goal>m.away_team_goal) THEN 0
        END
    ) AS 'points', 
    SUM(IF(t.team_api_id=m.home_team_api_id, m.home_team_goal, m.away_team_goal)) AS 'goal_for', 
    SUM(IF(t.team_api_id=m.home_team_api_id, m.away_team_goal, m.home_team_goal)) AS 'goal_against', 
    SUM(IF(t.team_api_id=m.home_team_api_id, m.home_team_goal, m.away_team_goal))-SUM(IF(t.team_api_id=m.home_team_api_id, m.away_team_goal, m.home_team_goal)) AS 'goal_diff'
FROM Team t JOIN `match` m ON t.team_api_id=m.home_team_api_id OR t.team_api_id=m.away_team_api_id
WHERE t.team_api_id in (10260, 8634, 8650, 9904, 8636)
GROUP BY team_api_id, season
ORDER BY season, points DESC, goal_diff DESC, team_long_name;""", conn)
team_season_score.head(10)

基于SQL和pandas的欧洲足球数据分析【附图详解】
列出几支球队的各赛季积分趋势图:

df = pd.DataFrame(index=np.sort(team_season_score['season'].unique()), columns=team_season_score['team_short_name'].unique())

df.loc[:,'MUN'] = list(team_season_score.loc[team_season_score['team_short_name']=='MUN','points'])
df.loc[:,'BAR'] = list(team_season_score.loc[team_season_score['team_short_name']=='BAR','points'])
df.loc[:,'LIV'] = list(team_season_score.loc[team_season_score['team_short_name']=='LIV','points'])
df.loc[:,'HAN'] = list(team_season_score.loc[team_season_score['team_short_name']=='HAN','points'])
df.loc[:,'INT'] = list(team_season_score.loc[team_season_score['team_short_name']=='INT','points'])

df.plot(figsize=(12,5),title='各赛季积分趋势图')

基于SQL和pandas的欧洲足球数据分析【附图详解】
结束语

此次项目的数据提供了不少数据,就连球员的EA SPORT游戏中的数据都有,不过这个项目我主要也是拿来练手,并没有刻意去将全部的数据都拿来分析,大概感兴趣的话也可以自己尝试一下。
我把最初kaggle提供的sqlite文件,后面我自己输出的csv文件和sql文件以及本次完整代码都放到了网盘上,需要的朋友自行下载:
链接:https://pan.baidu.com/s/1CbrdX0q18W9v-PgNyZfRPg
提取码:1024

上一篇:【2019牛客暑期多校训练营(第二场)- F】Partition problem(dfs,均摊时间优化)


下一篇:CodeForces 96A Football (C)