RoSIN is a formal grammar for describing baseball plays. It is based on a language that has been used to describe baseball play-by-play for over 10 years. That language -- used by RetroSheet, the baseball stats archive found at retrosheet.org -- is a somewhat more expanded syntax, and has been used to account for every Major League Baseball game from 1908 to 1992.
The original RetroSheet syntax is part of a larger data format called the RetroSheet "Event File." These Event Files contain several components, including:
Within any given Play Description, the RetroSheet Event File includes a RoSIN-esque string, but also includes an inning value, out-count, and Player ID. For example:
play,8,0,santr001,21,BBCX,S1.2-H(NR);1-H(NR);B-3(E1)
The RoSIN-esque string is the final field in the line, and has been highlighted in bold. RoSIN itself has been created to be a simplified (but fully functional) subset of all RetroSheet Event File play strings.
By zeroing in on a select subset of all available characters in RetroSheet's play descriptions, RoSIN enables software parsers to unambiguously validate play-by-play reporting.
Such validation should be able to help ensure the accuracy of data entry tools, and also simplify the construction of statistical presentation tools. Catching errors within play descriptions should also help prevent data integrity issues from populating databases.
Given that one of the design objectives of RoSIN is to be able to represent any play found in any RetroSheet Event File, every RetroSheet play should be able to be normalized into RoSIN.
An agreed-upon RoSIN specification can help software developers work with baseball content in many ways. This section focuses on the use of RoSIN along with two other sports data standards:
SportsML is a robust, readily extensible standard for exchanging information about all sorts of sporting events. The schema is architected such that data properties that are common across most or many sports are included in the "SportsML Core," whereas sport-specific items reside within plugin schemas. SportsML documents can house lineups, injury reports, box scores, batter-by-batter coverage, cumulative season stats, contextual stats, and much more.
Within SportsML's <event-actions> section, there are constructs to describe at-bats and player substitutions. The <action-baseball-play> and <action-baseball-score> elements contain attributes for many high-level datapoints of the play, but do not hold attributes that can trace the path of the ball amongs defensive players, like RoSIN does. These SportsML elements do contain an attribute for "scorekeeper notation" which should house the full RoSIN string.
SportsML can also house details of every pitch, much like the RetroSheet Event Files do. Current SportsML pitch descriptions include pitch-type and ball-location. Other attributes could be added in the future, including pitch-velocity and vector data.
SportsDB is a relational database schema whose design objectives are to:
Not every SportsML attribute and RoSIN substring will or should be exerpted into its own SportsDB database field. Only those that SportsDB users would clearly want to independently index and query need to be parsed into a unique field.
Pointers to the original, full SportsML file, and a field for the full RoSIN string (as well as a YAML version of the RoSIN string) can be stored in SportsDB, to enable further querying if necessary.