Ever wondered what makes a FIFA player great on paper? In this post, you’ll learn how to analyze FIFA 19 data using Python — from cleaning and visualizing stats to building a model that predicts player ratings. ⚽📊
In this blog, we’ll walk through a full data analysis and machine learning pipeline using FIFA 19 player data. From cleaning messy data to visualizing key performance indicators — and finally predicting player ratings — you’ll see how it all comes together.
🧹 Step 1: Load and Explore the FIFA 19 Dataset
Let’s kick things off by importing the necessary libraries and loading the dataset.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import eli5
from eli5.sklearn import PermutationImportance
from collections import Counter
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
import plotly
sns.set_style('darkgrid')
df = pd.read_csv('../input/data.csv')
df.head().T # Preview top records transposed
df.columns # List all column names
df.info() # Check data types and missing values
df.describe().T # Get descriptive stats
🧼 Step 2: Data Cleaning
Let’s drop unnecessary visual columns that don’t help with analysis.
df.drop(['Unnamed: 0','Photo','Flag','Club Logo'], axis=1, inplace=True)
Visualize missing data

msno.bar(df.sample(18207), (28, 10), color='red')
df.isnull().sum()
We noticed a pattern: some rows are completely missing height and weight data. Let’s see if it’s the same rows.
missing_height = df[df['Height'].isnull()].index.tolist()
missing_weight = df[df['Weight'].isnull()].index.tolist()
if missing_height == missing_weight:
print('They are same')
Since the rows with missing height and weight are identical, we’ll drop them.
df.drop(df.index[missing_height], inplace=True)
Also, let’s remove some more sparse columns:
df.drop(['Loaned From','Release Clause','Joined'], axis=1, inplace=True)
📊 Step 3: Basic Data Analysis
Top 5 Countries and Clubs with Most Players
print('Total number of countries:', df['Nationality'].nunique())
print(df['Nationality'].value_counts().head(5))
print('Total number of clubs:', df['Club'].nunique())
print(df['Club'].value_counts().head(5))
Best Player by Potential and Overall Rating
print('Maximum Potential:', df.loc[df['Potential'].idxmax()][1])
print('Maximum Overall Performance:', df.loc[df['Overall'].idxmax()][1])
Find the best player per skill category
pr_cols = [...] # List of performance columns
for col in pr_cols:
print(f'Best {col}: {df.loc[df[col].idxmax()][1]}')
Convert Value and Wage into numeric format
def value_to_int(df_value):
try:
value = float(df_value[1:-1])
suffix = df_value[-1:]
if suffix == 'M': value *= 1_000_000
elif suffix == 'K': value *= 1_000
except:
value = 0
return value
df['Value'] = df['Value'].apply(value_to_int)
df['Wage'] = df['Wage'].apply(value_to_int)
Top Earners
print('Most valued player:', df.loc[df['Value'].idxmax()][1])
print('Highest earner:', df.loc[df['Wage'].idxmax()][1])
📈 Step 4: Analyze FIFA 19 Data with Visualizations
Age vs Potential

sns.jointplot(x=df['Age'], y=df['Potential'],
joint_kws={'alpha':0.1,'s':5,'color':'red'},
marginal_kws={'color':'red'})
Age vs Sprint Speed

sns.lmplot(data=df, x='Age', y='SprintSpeed', lowess=True,
scatter_kws={'alpha':0.01, 's':5,'color':'green'},
line_kws={'color':'red'})
Ball Control vs Dribbling by Preferred Foot

sns.lmplot(x='BallControl', y='Dribbling', data=df, col='Preferred Foot',
scatter_kws={'alpha':0.1,'color':'orange'}, line_kws={'color':'red'})
Dribbling vs Crossing (Hexbin)

sns.jointplot(x=df['Dribbling'], y=df['Crossing'], kind="hex", color="#4CB391")
Age vs Potential with Value as Hue

sns.relplot(x="Age", y="Potential", hue=df.Value / 100000,
sizes=(40, 400), alpha=.5, height=6, data=df)
Correlation Heatmap

corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr, mask=mask, square=True, linewidths=.8, cmap="YlGnBu")
Boxenplot: Overall vs Age (by Preferred Foot)

sns.boxenplot(df['Overall'], df['Age'], hue=df['Preferred Foot'], palette='rocket')
plt.title('Comparison of Overall Scores and Age wrt Preferred foot', fontsize=20)
Pairplot: Physical Stats

cols = ['Age','Overall','Potential','Acceleration','SprintSpeed',"Agility","Stamina",'Strength','Preferred Foot']
df_small = df[cols]
sns.pairplot(df_small, hue='Preferred Foot', palette=["black", "red"],
plot_kws=dict(s=50, alpha=0.8), markers=['^','v'])
🤖 Step 5: Predicting Overall Rating with Linear Regression
Reload and Clean Dataset Again
df = pd.read_csv('../input/data.csv')
drop_cols = df.columns[28:54]
df.drop(drop_cols, axis=1, inplace=True)
df.drop([...], axis=1, inplace=True) # Drop irrelevant or sparse columns
df.dropna(inplace=True)
Feature Engineering
def face_to_num(df): return 1 if df['Real Face'] == 'Yes' else 0
def right_footed(df): return 1 if df['Preferred Foot'] == 'Right' else 0
def simple_position(df):
if df['Position'] == 'GK': return 'GK'
elif df['Position'] in ['RB','LB','CB','LCB','RCB','RWB','LWB']: return 'DF'
elif df['Position'] in ['LDM','CDM','RDM']: return 'DM'
elif df['Position'] in ['LM','LCM','CM','RCM','RM']: return 'MF'
elif df['Position'] in ['LAM','CAM','RAM','LW','RW']: return 'AM'
elif df['Position'] in ['RS','ST','LS','CF','LF','RF']: return 'ST'
return df['Position']
nat_list = df.Nationality.value_counts()[lambda x: x > 250].index.tolist()
def major_nation(df): return 1 if df.Nationality in nat_list else 0
df1 = df.copy()
df1['Real_Face'] = df1.apply(face_to_num, axis=1)
df1['Right_Foot'] = df1.apply(right_footed, axis=1)
df1['Simple_Position'] = df1.apply(simple_position, axis=1)
df1['Major_Nation'] = df1.apply(major_nation, axis=1)
tempwork = df1["Work Rate"].str.split("/ ", n=1, expand=True)
df1["WorkRate1"] = tempwork[0]
df1["WorkRate2"] = tempwork[1]
df1.drop(['Work Rate','Preferred Foot','Real Face', 'Position','Nationality'], axis=1, inplace=True)
Train-Test Split and Model Training
target = df1['Overall']
df2 = df1.drop(['Overall'], axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df2, target, test_size=0.2)
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error
print('r2 score:', r2_score(y_test, predictions))
print('RMSE:', np.sqrt(mean_squared_error(y_test, predictions)))
Feature Importance with ELI5
perm = PermutationImportance(model, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist())
Top features that influence player rating? Potential, Age, and Reactions.
Final Visualization

sns.regplot(predictions, y_test, scatter_kws={'color':'red','edgecolor':'blue','linewidth':0.7}, line_kws={'color':'black','alpha':0.5})
plt.xlabel('Predictions')
plt.ylabel('Overall')
plt.title("Linear Prediction of Player Rating")
🏁 Conclusion
This complete walkthrough showed you how to analyze FIFA 19 data using Python — from data cleaning to machine learning predictions. Try it out and level up your data science game!
If you’re into football and data science — this is how you score a goal in both fields.
⚡ Ready to build your own FIFA analytics dashboard?
Check out our other Python projects or contact Ossels AI to build AI-powered tools for your business.
Got questions or want to suggest another dataset? Drop a comment below!