Little update because I got at least some interest. I have been working on the Retrosheet event files, which give a play-by-play account of games to varying degrees. From these you can reconstruct quite a bit, obviously the further you go back the less quality it is and (e..g, you don't get pitch data until ~1990).
Anyway I am now giving all games a uniquely generated ID, to simply the DB operations. Testing my play-by-play processor, and it turns out that game 100,000 was played between Baltimore and Boston on September 6th, 1963. I am generating that looks like this:
applyPlay B1 BOS 1 BAL 0 2 outs
BAL poweb101 at bat BOS monbb101 pitching
onBase: 1: 2: 3:
play= D/78
double
play runs= 0 outs= 0
applyPlay B1 BOS 1 BAL 0 2 outs
BAL gentj101 at bat BOS monbb101 pitching
onBase: 1: 2: poweb101 3:
play= W
walk
runner on 2 does not advance
play runs= 0 outs= 0
applyPlay B1 BOS 1 BAL 0 2 outs
BAL branj101 at bat BOS monbb101 pitching
onBase: 1: gentj101 2: poweb101 3:
play= HR.2-H;1-H
home run
runner on 2 ( poweb101 ) scored
runner on 1 ( gentj101 ) scored
basesAdv= {1, 2}
play runs= 3 outs= 0
The "play" line is what I'm getting from Retrosheet, from there I have to infer just about everything else. (It does have inning and batter info, but I also keep track of this internally and verify I am doing the right thing, so like it says a certain batter is up, I check to see if they were next in the order)
Anyway parsing some of these "play" lines can be quite challening... this particularly game I think everything was the same as Baseball reference (I summarize at the end), except like one catcher assist was credited as a putout.