Ever wondered what stories Olympic medalist data from 1896 to 2024 can tell us? In this beginner-friendly journey, we’ll use Python for data analysis to explore 128 years of Olympic history – a fun dive into sports analytics. Together, we’ll walk through cleaning a real dataset, doing exploratory analysis, and creating Olympic data visualizations. By the end, you’ll see how Python can uncover trends in the Olympics: which countries dominate the podium, how gender participation has changed, and even how long athletes stay at the top. Grab a coffee and let’s explore this Python data analysis of Olympic medalists like we’re just two friends geeking out over sports stats!
Introduction to the Olympic Medalist Dataset (1896–2024)
Let’s start with the data. Our dataset is a comprehensive collection of Olympic medal winners from the very first modern Games in 1896 up to the 2024 Olympics (yes, Paris 2024 is included!). Each row represents an athlete who won a medal in an Olympic event, including details about the athlete and the event. It’s a rich dataset spanning Summer and Winter Olympics, covering multiple generations of sports heroes. To keep it simple, here’s a breakdown of the key columns you’ll encounter:
| Column | Description |
|---|---|
| Name | Athlete’s name (the medalist) |
| ID | Unique ID for each athlete |
| Sex | Gender of the athlete (M or F) |
| Age | Age of the athlete at the time of the Olympics (in years) |
| Height | Athlete’s height (in centimeters) |
| Weight | Athlete’s weight (in kilograms) |
| Team | Team or country name the athlete represented |
| NOC | National Olympic Committee 3-letter country code (e.g., USA, GBR) |
| Games | Year and season of the Olympics (e.g., 2016 Summer) |
| Year | The year of the Olympic Games (e.g., 2016) |
| Season | Summer or Winter Olympics indicator |
| City | Host city of the Olympics |
| Sport | Sport category (e.g., Athletics, Swimming) |
| Event | Specific event name (e.g., 100m Freestyle Women) |
| Medal | Medal won: Gold, Silver, or Bronze |
Table: Key fields in the Olympic medalist dataset (1896–2024). Each row of data is one athlete’s medal in a specific event.
This dataset is quite large (tens of thousands of medal records!), reflecting over a century of Olympic competitions. For example, Athens 1896 had only 43 events across a handful of sports, whereas Tokyo 2020 featured 339 events – the largest Olympic program to date. That means the number of medals (and rows in our dataset) grew enormously over time along with the Games!
Data source: The data is available thanks to open Olympic datasets (such as those on Kaggle), so you can follow along or try your own analysis. Now, let’s get our hands dirty with the data using Python.
Data Cleaning and Preprocessing with Python
Before jumping into charts and insights, we need to clean and preprocess the data. Think of this step as warming up before a game – it ensures our analysis runs smoothly. Here’s how we tackle it:
- Loading the data: We’ll use pandas to load the dataset (often a CSV file). For example:
import pandas as pd df = pd.read_csv('olympic_medalists_1896_2024.csv') print(df.head(5)) # sneak peek at first 5 rowsThis gives us a preview of the data structure and values. At a glance, we might see athletes’ names, their country, sport, year, medal, etc. - Handling missing values: Olympic data goes way back, so expect some missing entries (for instance, early Games might not have recorded athletes’ ages or weights). We check for nulls with
df.isnull().sum(). If certain columns like Age or Height have many blanks, we can decide to fill them (e.g., with an average or placeholder) or drop them if they’re not crucial to our analysis. For beginner-friendly analysis, we might ignore those columns if not needed. - Standardizing data: Make sure text is consistent. For example, the Team field might list countries by name, but some country names changed over time (e.g., Soviet Union vs Russia). The dataset provides a standardized country code in NOC, which we can use for consistency. We might create a separate lookup for NOC to full country name for readability.
- Removing duplicates: It’s unlikely to have duplicate entries in a medalist list (each athlete-event-medal is unique). But if our data had any accidental duplicates,
df.drop_duplicates()fixes that. - Data types: Ensure numeric fields are numeric (Year, Age, etc.) and dates if any, are properly formatted. In our dataset, Year is already an integer and doesn’t need conversion, but if we had a date column, we’d parse it to datetime.
- Filtering medalists: If our dataset included all participants (with Medal value NA for non-winners), we would filter to just medal winners for analysis, e.g.
df = df[df['Medal'].notna()]. In our medalist-only dataset, this is already taken care of – every row has a Gold, Silver, or Bronze. - Accounting for team events: One tricky part – in team sports (like relay races or hockey), each team member appears as a separate row with a medal. That means a single team victory can show up as multiple Gold medal entries (one per athlete). Keep this in mind: if we simply count medals by country using this data, team sports could inflate counts. We may need to adjust for this if we want medal counts by event rather than by athlete. A simple approach for now: treat each row as an individual medal achievement (since we are focusing on medalists themselves).
With clean data in hand, we’re ready for the fun part – exploration and visualization!
Exploratory Data Analysis (EDA) and Olympic Data Visualization
Now for the main event: let’s explore the data to uncover patterns and insights. We’ll use Python libraries like pandas, Matplotlib, or Seaborn to crunch numbers and create some visualizations. No heavy statistics needed – just grouping data and plotting to see trends. Below, we break down a few key questions and findings from the Olympic medalist data.
Medal Counts Over Time
How have the Olympics grown over the years? A quick group-by can tell us how many medals were awarded each year:
medals_per_year = df.groupby('Year')['Medal'].count().reset_index()
print(medals_per_year.tail(5))
This might show, for example, that 1896 had ~120 medal awards (since there were 43 events, and 3 medals per event in those days), whereas 2016 or 2020 have closer to 900+ medals awarded.
We would see a steadily rising line. Early 20th century Games had fewer events (and medals), but by the late 20th and 21st century, the line shoots up. This reflects the Olympics adding more sports and events over time. Notice occasional dips or missing years – for example, 1940 and 1944 were canceled due to WWII, so no medals then. Overall, the trend is clear: the Olympics have expanded dramatically, offering more medals than ever in recent decades!
Top Countries in Olympic History
Which countries have the most Olympic hardware? We can rank nations by total medals. Using the Team or NOC field:
top_countries = df.groupby('Team')['Medal'].count().sort_values(ascending=False).head(10)
print(top_countries)
This gives the top 10 medal-winning teams. Unsurprisingly, the United States leads by a mile. Over the entire 1896–2024 period, Team USA accumulated over 3,000 Olympic medals in total (combining Summer & Winter Games) – a testament to their long-running sports prowess. Other top countries include the likes of the former Soviet Union (and Russia), Germany, Great Britain, and China, all with hefty medal counts in the high hundreds or thousands.
A bar chart of the all-time top medal countries shows a steep drop-off after the US. For example, the second-place (Soviet Union/Russia combined) has around 1,200 medals, which is impressive but still only about a third of the U.S. total. One thing to note is that some countries’ totals include historical nations (e.g., USSR) or united teams. But as a whole, a small group of countries has won a large share of Olympic medals.
Gender Trends in Olympic Medalists
The Olympics started as an almost exclusively male affair – but how have things changed in terms of female participation? Our data has a Sex column (M/F), so we can examine the number of medals won by women over time versus men:
medals_by_gender_year = df.groupby(['Year','Sex'])['Medal'].count().reset_index()
# perhaps pivot to make two columns: Year, Male_medals, Female_medals
medals_by_gender = medals_by_gender_year.pivot(index='Year', columns='Sex', values='Medal').fillna(0)
print(medals_by_gender.tail(5))
Plotting the trends for Summer Games, for example:
We’d see that women’s medal counts were extremely low in early Olympics – indeed, women first competed in 1900 in limited events. Through the early 20th century, female medalists were a tiny fraction. But the lines steadily rise: by mid-20th century, more women are on the podium, and in recent Games, women’s medals are approaching or equal to men’s in number. In fact, in the 2020s Olympics, women often make up almost half of the medal events as more women’s sports have been added. This is a remarkable shift toward gender equality in Olympic sports over 100+ years. (The Winter Olympics show a similar catch-up pattern, just offset since Winter Games started in 1924.)
In short, from diminutive beginnings for women athletes to near parity today – the data visualization of this trend is both inspiring and a clear indicator of social progress in sports.
Athlete Age and Longevity
Olympic glory isn’t just for the young – so what does the data say about athlete ages and longevity? We can explore the age distribution of medalists:
print(df['Age'].describe())
This will give us the average, min, and max ages of medal winners. We might find the average age is somewhere in the mid-20s (many athletes peak in their 20s). But the range can be surprising: the youngest medalists are teenagers (even as young as 13 in some sports like skateboarding in 2020!), whereas the oldest medalist on record was a 72-year-old shooter in 1920 – yes, someone grandpa-age won an Olympic medal, proving age is just a number!
We can also plot a histogram of ages:
It would likely show a bell-curve centered around 20–30 years, with a tail towards older ages (for sports like shooting or equestrian where athletes in their 40s, 50s or beyond have succeeded). This analysis might prompt us to dig deeper: Are certain sports skewing older (e.g., equestrian, shooting) and others skew younger (e.g., gymnastics, which often has teen champions)? A quick breakdown by sport could confirm that, but we’ll keep things simple here.
Athlete longevity can also refer to how long athletes compete. Some legends appear in multiple Olympics over decades. With a bit more advanced analysis, we could track athletes by ID to see how many different years they won medals. That could uncover the longest Olympic careers (there are athletes with 5+ Olympic appearances!). However, even without crunching that, the age data already shows that Olympic medalists can span generations – from school-age prodigies to seasoned veterans.

Feature Engineering: Adding New Insights
To level up our analysis, let’s create a few new features from the data. Feature engineering means adding extra columns or metrics that help reveal patterns. Here are a few beginner-friendly ones that yield cool insights:
- Decade: Instead of looking year by year (which can be a lot of data points), we can group results by decade. For example, create a column
Decadethat floors the Year to its decade start (e.g., 1996 -> 1990s). This helps to compare eras: the 1890s vs 1990s vs 2010s, etc. We might find that the 2010s had an explosion of medals compared to, say, the 1950s, due to the Olympic expansion. It smooths out year-to-year fluctuations (like missing Games during wars) and highlights long-term growth. - Team vs Individual Event: We mentioned earlier that team sports can skew medal counts. If our dataset has an indicator of whether an event is team-based or individual (some datasets have a column for participant type or we can infer from the Sport), we can create a feature
EventTypewith values “Team” or “Individual”. This lets us analyze medals from team events separately. For example, do team events contribute a large share of medals? We might discover that team sports award fewer medals overall (because there are fewer team events than individual events), or see how team event medals are distributed among countries. If not directly given, we could infer: sports like Basketball, Hockey, Relay (in Athletics/Swimming) are team events, whereas most track, swimming races, etc., are individual. With this feature, one analysis could be a bar chart comparing the count of gold/silver/bronze from team vs individual events, perhaps showing team events tend to yield multiple medalists at once (one team win = many athletes get medals, which is why our dataset has more medalist entries for those events). - Medal Points: Not all medals shine equally in the eyes of some analysts. A common sports analytics technique is to assign point values to medals to create a weighted score. For instance, Gold = 3 points, Silver = 2, Bronze = 1 (you can choose any scheme, but this one is simple). We can add a new column
Pointsusing a map:medal_points = {'Gold': 3, 'Silver': 2, 'Bronze': 1} df['Points'] = df['Medal'].map(medal_points)Now, if we sumPointsby country, we get a medal points tally. This can sometimes change the ranking of top countries (since a country with fewer but mostly gold medals might outrank one with many bronzes). It’s an interesting lens: for example, countries like China often rank higher by golds (and thus by points) than by total count. You could create a table of top 10 countries by points versus by raw count to see the difference. It’s a neat feature for Olympic data visualization too – imagine a stacked bar where each medal type’s contribution is weighted.

By engineering these features, we extend our analysis beyond what’s directly in the dataset. It’s like asking new questions: Which decade was the most competitive? Do some countries specialize in team sports? Who wins when we value golds more? Feature engineering helps transform raw Olympic medalist data into deeper sports analytics insights.
Key Insights Recap
We’ve crunched a lot of numbers and plotted some graphs. Let’s summarize the key insights we discovered from the Olympic medalist data:
- The Olympics have grown massively: From just 43 events in 1896 to 339 events in 2020, the number of medals available each Games has skyrocketed. Our line chart of medals over time showed a dramatic rise, especially post-1950s as new sports were added.
- A few countries dominate the medal counts: The United States is the all-time leader in Olympic medals (over 3,000 to date), far ahead of the next countries. The Soviet Union/Russia, Germany, Great Britain, and China round out the top contenders, but no one comes close to the U.S. in total hardware. A small group of nations takes a large share of medals, reflecting historical sports investment and population.
- Gender balance has improved over time: Early Olympics had very few female medalists, but today women win almost as many medals as men. The gap has closed significantly, especially from the late 20th century onward, thanks to the inclusion of more women’s events and broader female participation. This is visible in our male vs female trends chart – the lines are converging.
- Athletes come in all ages (and ages!): While most medalists are in their 20s, we saw that teens can triumph and older athletes can still snag medals. The oldest Olympic medalist was 72 years old – a reminder that experience can beat youth in some sports. Different sports have different age profiles, but overall, Olympic champions aren’t limited to a narrow age range.
- Custom metrics add perspective: By creating features like decade groupings, team vs individual, or medal points, we can reinterpret the data. For instance, scoring medals by points can shuffle country rankings; distinguishing team events helps us avoid over-counting in medal stats. These engineered insights deepen our understanding beyond the basics.
Conclusion: Exploring Olympic Data with Python
Analyzing Olympic medalist data with Python has shown us how much we can learn from over a century of sports history. Using simple Python data analysis techniques (pandas for data crunching, plotting libraries for visualization), we uncovered trends that turn raw numbers into stories – the rise of nations, the empowerment of women in sports, and the timeless quest for gold. This journey barely scratches the surface of Olympic sports analytics, but it’s a solid start that any beginner can replicate and build upon.
Ready to explore further? This dataset is a goldmine (pun intended) for curious minds. Try slicing the data in different ways: maybe look at your favorite sport, or see how host countries perform when the Games are at home. You could even use advanced tools to predict future medal counts! The possibilities are endless.
In summary, Python makes it easy (and fun) to turn historical data into insights. Why not give it a try? Grab the Olympic dataset, follow the steps we outlined, and see what Olympic data visualization you come up with. We encourage you to play with the data and share your own Olympic insights – after all, the best way to learn data analysis is by diving into a topic you find exciting. Happy analyzing, and may the sports (data) be ever in your favor! 🎉
🔗 Explore More on Ossels AI
If you enjoyed this Python data analysis project, here are more tutorials and tools you’ll love:
🧠 Internal Links (Related Ossels AI Blog Posts)
- 🧮 How to Predict Your Salary Using Python and Machine Learning
- 📊 How to Analyze FIFA 19 Data in Python
- 📈 How to Build a Bitcoin Price Predictor with LSTM
- 🎨 Cartoonify Images with Python: The Ultimate Streamlit Guide
- 📄 Ultimate Guide to Build a PDF to Text Converter with Python
🌍 External Resources (Credible Sources for Further Reading)
- 🏅 Olympic Games Medal Counts – IOC Official Website
- 📂 Olympic Athlete Dataset on Kaggle
- 🐍 Pandas Documentation – Data Analysis with Python
- 📊 Matplotlib – Python Plotting Library
- 📈 Seaborn – Statistical Data Visualization
🎯 Ready to Dive Deeper?
Want to build your own Olympic dashboard or explore other real-world datasets with Python? Check out more tutorials on Ossels AI Blog and start your data journey today.
And if you’re working on something cool, drop a comment or share your own Olympic insights below — we’d love to see what you’re building!