How to Analyze FIFA 19 Data in Python

Ever wondered what makes a FIFA player great on paper? In this post, you’ll learn how to analyze FIFA 19 data using Python — from cleaning and visualizing stats to building a model that predicts player ratings. ⚽📊

In this blog, we’ll walk through a full data analysis and machine learning pipeline using FIFA 19 player data. From cleaning messy data to visualizing key performance indicators — and finally predicting player ratings — you’ll see how it all comes together.

🧹 Step 1: Load and Explore the FIFA 19 Dataset

Let’s kick things off by importing the necessary libraries and loading the dataset.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import eli5
from eli5.sklearn import PermutationImportance
from collections import Counter
import missingno as msno

import warnings
warnings.filterwarnings('ignore')
import plotly
sns.set_style('darkgrid')

df = pd.read_csv('../input/data.csv')
df.head().T  # Preview top records transposed
df.columns  # List all column names
df.info()  # Check data types and missing values
df.describe().T  # Get descriptive stats

🧼 Step 2: Data Cleaning

Let’s drop unnecessary visual columns that don’t help with analysis.

df.drop(['Unnamed: 0','Photo','Flag','Club Logo'], axis=1, inplace=True)

Visualize missing data

msno.bar(df.sample(18207), (28, 10), color='red')
df.isnull().sum()

We noticed a pattern: some rows are completely missing height and weight data. Let’s see if it’s the same rows.

missing_height = df[df['Height'].isnull()].index.tolist()
missing_weight = df[df['Weight'].isnull()].index.tolist()

if missing_height == missing_weight:
    print('They are same')

Since the rows with missing height and weight are identical, we’ll drop them.

df.drop(df.index[missing_height], inplace=True)

Also, let’s remove some more sparse columns:

df.drop(['Loaned From','Release Clause','Joined'], axis=1, inplace=True)

📊 Step 3: Basic Data Analysis

Top 5 Countries and Clubs with Most Players

print('Total number of countries:', df['Nationality'].nunique())
print(df['Nationality'].value_counts().head(5))
print('Total number of clubs:', df['Club'].nunique())
print(df['Club'].value_counts().head(5))

Best Player by Potential and Overall Rating

print('Maximum Potential:', df.loc[df['Potential'].idxmax()][1])
print('Maximum Overall Performance:', df.loc[df['Overall'].idxmax()][1])

Find the best player per skill category

pr_cols = [...]  # List of performance columns
for col in pr_cols:
    print(f'Best {col}: {df.loc[df[col].idxmax()][1]}')

Convert Value and Wage into numeric format

def value_to_int(df_value):
    try:
        value = float(df_value[1:-1])
        suffix = df_value[-1:]
        if suffix == 'M': value *= 1_000_000
        elif suffix == 'K': value *= 1_000
    except:
        value = 0
    return value

df['Value'] = df['Value'].apply(value_to_int)
df['Wage'] = df['Wage'].apply(value_to_int)

Top Earners

print('Most valued player:', df.loc[df['Value'].idxmax()][1])
print('Highest earner:', df.loc[df['Wage'].idxmax()][1])

📈 Step 4: Analyze FIFA 19 Data with Visualizations

Age vs Potential

sns.jointplot(x=df['Age'], y=df['Potential'],
              joint_kws={'alpha':0.1,'s':5,'color':'red'},
              marginal_kws={'color':'red'})

Age vs Sprint Speed

sns.lmplot(data=df, x='Age', y='SprintSpeed', lowess=True,
           scatter_kws={'alpha':0.01, 's':5,'color':'green'},
           line_kws={'color':'red'})

Ball Control vs Dribbling by Preferred Foot

sns.lmplot(x='BallControl', y='Dribbling', data=df, col='Preferred Foot',
           scatter_kws={'alpha':0.1,'color':'orange'}, line_kws={'color':'red'})

Dribbling vs Crossing (Hexbin)

sns.jointplot(x=df['Dribbling'], y=df['Crossing'], kind="hex", color="#4CB391")

Age vs Potential with Value as Hue

sns.relplot(x="Age", y="Potential", hue=df.Value / 100000,
            sizes=(40, 400), alpha=.5, height=6, data=df)

Correlation Heatmap

corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(15, 15))
    ax = sns.heatmap(corr, mask=mask, square=True, linewidths=.8, cmap="YlGnBu")

Boxenplot: Overall vs Age (by Preferred Foot)

sns.boxenplot(df['Overall'], df['Age'], hue=df['Preferred Foot'], palette='rocket')
plt.title('Comparison of Overall Scores and Age wrt Preferred foot', fontsize=20)

Pairplot: Physical Stats

cols = ['Age','Overall','Potential','Acceleration','SprintSpeed',"Agility","Stamina",'Strength','Preferred Foot']
df_small = df[cols]
sns.pairplot(df_small, hue='Preferred Foot', palette=["black", "red"],
             plot_kws=dict(s=50, alpha=0.8), markers=['^','v'])

🤖 Step 5: Predicting Overall Rating with Linear Regression

Reload and Clean Dataset Again

df = pd.read_csv('../input/data.csv')
drop_cols = df.columns[28:54]
df.drop(drop_cols, axis=1, inplace=True)
df.drop([...], axis=1, inplace=True)  # Drop irrelevant or sparse columns
df.dropna(inplace=True)

Feature Engineering

def face_to_num(df): return 1 if df['Real Face'] == 'Yes' else 0
def right_footed(df): return 1 if df['Preferred Foot'] == 'Right' else 0

def simple_position(df):
    if df['Position'] == 'GK': return 'GK'
    elif df['Position'] in ['RB','LB','CB','LCB','RCB','RWB','LWB']: return 'DF'
    elif df['Position'] in ['LDM','CDM','RDM']: return 'DM'
    elif df['Position'] in ['LM','LCM','CM','RCM','RM']: return 'MF'
    elif df['Position'] in ['LAM','CAM','RAM','LW','RW']: return 'AM'
    elif df['Position'] in ['RS','ST','LS','CF','LF','RF']: return 'ST'
    return df['Position']

nat_list = df.Nationality.value_counts()[lambda x: x > 250].index.tolist()
def major_nation(df): return 1 if df.Nationality in nat_list else 0

df1 = df.copy()
df1['Real_Face'] = df1.apply(face_to_num, axis=1)
df1['Right_Foot'] = df1.apply(right_footed, axis=1)
df1['Simple_Position'] = df1.apply(simple_position, axis=1)
df1['Major_Nation'] = df1.apply(major_nation, axis=1)

tempwork = df1["Work Rate"].str.split("/ ", n=1, expand=True)
df1["WorkRate1"] = tempwork[0]
df1["WorkRate2"] = tempwork[1]

df1.drop(['Work Rate','Preferred Foot','Real Face', 'Position','Nationality'], axis=1, inplace=True)

Train-Test Split and Model Training

target = df1['Overall']
df2 = df1.drop(['Overall'], axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df2, target, test_size=0.2)

X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

from sklearn.metrics import r2_score, mean_squared_error
print('r2 score:', r2_score(y_test, predictions))
print('RMSE:', np.sqrt(mean_squared_error(y_test, predictions)))

Feature Importance with ELI5

perm = PermutationImportance(model, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist())

Top features that influence player rating? Potential, Age, and Reactions.

Final Visualization

sns.regplot(predictions, y_test, scatter_kws={'color':'red','edgecolor':'blue','linewidth':0.7}, line_kws={'color':'black','alpha':0.5})
plt.xlabel('Predictions')
plt.ylabel('Overall')
plt.title("Linear Prediction of Player Rating")

🏁 Conclusion

This complete walkthrough showed you how to analyze FIFA 19 data using Python — from data cleaning to machine learning predictions. Try it out and level up your data science game!

If you’re into football and data science — this is how you score a goal in both fields.

⚡ Ready to build your own FIFA analytics dashboard?
Check out our other Python projects or contact Ossels AI to build AI-powered tools for your business.

Got questions or want to suggest another dataset? Drop a comment below!

How to Analyze FIFA 19 Data in Python

🧹 Step 1: Load and Explore the FIFA 19 Dataset

🧼 Step 2: Data Cleaning

📊 Step 3: Basic Data Analysis

📈 Step 4: Analyze FIFA 19 Data with Visualizations

Age vs Potential

Age vs Sprint Speed

Ball Control vs Dribbling by Preferred Foot

Dribbling vs Crossing (Hexbin)

Age vs Potential with Value as Hue

Correlation Heatmap

Boxenplot: Overall vs Age (by Preferred Foot)

Pairplot: Physical Stats

🤖 Step 5: Predicting Overall Rating with Linear Regression

Reload and Clean Dataset Again

Feature Engineering

Train-Test Split and Model Training

Feature Importance with ELI5

Final Visualization

🏁 Conclusion

Posted by Ananya Rajeev

Adblock Detected!

🧹 Step 1: Load and Explore the FIFA 19 Dataset

🧼 Step 2: Data Cleaning

📊 Step 3: Basic Data Analysis

📈 Step 4: Analyze FIFA 19 Data with Visualizations

Age vs Potential

Age vs Sprint Speed

Ball Control vs Dribbling by Preferred Foot

Dribbling vs Crossing (Hexbin)

Age vs Potential with Value as Hue

Correlation Heatmap

Boxenplot: Overall vs Age (by Preferred Foot)

Pairplot: Physical Stats

🤖 Step 5: Predicting Overall Rating with Linear Regression

Reload and Clean Dataset Again

Feature Engineering

Train-Test Split and Model Training

Feature Importance with ELI5

Final Visualization

🏁 Conclusion

Share with friends

Tags

Posted by Ananya Rajeev

Adblock Detected!