Okay, so I got this itch a while back to really dig into some baseball stuff. Not just the box scores, you know? I wanted to see trends, maybe figure out some patterns myself. So, I thought, “Where do people get this data?” That started the whole journey into finding mlb data sets.

First off, I just started searching online. Typed in things like “baseball stats download,” “historical mlb data,” stuff like that. A few names popped up pretty quick. Lahman database, Retrosheet, Statcast. Sounded promising.
Getting Started: The Lahman Database
I decided to tackle the Lahman one first. Heard it was kinda the standard for historical data, going way back. Found a place to download it – it was just a bunch of CSV files zipped up. Seemed easy enough, right?
Well, I unzipped it and holy cow, there were a lot of files. Like, tons. Files for batting, pitching, fielding, teams, salaries, awards, managers… even stuff about halls of fame and player appearances. It was overwhelming at first glance.
- Opened the main ‘People’ file (or maybe it was ‘Master’? Memory’s a bit fuzzy). Lots of player IDs, names, birth dates. Okay, makes sense.
- Then opened the ‘Batting’ file. More IDs, stats year by year.
- Tried to figure out how to link them. Like, find Babe Ruth’s batting stats. Took me a bit to realize I needed to match the ‘playerID’ across the files.
I initially tried just using my regular spreadsheet program. It kinda worked for looking at one file, but trying to combine info from the ‘People’ file and the ‘Batting’ file? It choked. Or maybe I choked, trying to figure out VLOOKUPs across massive tables. It got messy fast.
Dipping Toes into Other Waters
While wrestling with Lahman, I kept reading about Retrosheet. People said it had play-by-play data. Like, every single pitch and event in games going back decades. That sounded amazing, but also way more complicated. The data wasn’t just simple tables; it was in its own format. You needed specific tools to parse it. I downloaded some of their files, took one look, and thought, “Nope, not today.” Put that on the back burner.

Then there’s Statcast. This is the new hotness, right? All the detailed tracking data – exit velocity, launch angle, fielder movements. Super cool stuff you see on broadcasts. But getting access to that in bulk? Seemed like you needed some programming skills, maybe using Python or R libraries people have built to grab it from the web. Again, looked like a bigger project than just downloading some CSVs.
What I Actually Did
So, I stuck with Lahman for a bit. Fired up a simple database tool on my computer – just something basic to handle SQL queries. That made things way easier than spreadsheets.
I started simple:
- Found players with the most home runs in a specific decade.
- Looked at team win totals over time.
- Tried to see if batting averages have generally gone up or down.
It was fun! But even simple questions sometimes needed joining three or four different tables (like getting the player’s name, their stats, and the team name for that year). It took time just to understand the structure, figure out what columns meant (some abbreviations weren’t obvious), and clean up little inconsistencies.
Honestly, it was a bigger task than I expected. It wasn’t just plug-and-play. You gotta invest some time just learning the landscape of the data itself before you can even ask interesting questions. I didn’t get as far as building some fancy prediction model or anything. Mostly just explored and got a feel for what’s available and how messy real-world data can be, even curated stuff like Lahman.

It’s a deep well, this MLB data stuff. You can spend ages just cleaning and understanding it. But pretty cool what’s out there if you’re willing to roll up your sleeves.