The Uncommon Data Set
The Common Data Set is a beautiful idea. Twenty-seven years ago, three college-guide publishers, the College Board, Peterson's, and U.S. News, sat down with a bunch of college institutional research offices and agreed on a single template for reporting the numbers that matter about a school. Enrollment. Admissions. Retention. Tuition. Financial aid. Faculty.
The CDS Initiative still publishes that template today. It is a 47-page XLSX with 1,105 fields. It has a beautifully structured Answer Sheet tab. It is, genuinely, one of the cleanest open data standards in American higher education.
The name is “Common Data Set.”
The reality is extremely uncommon.
What we found
Schools publish their CDS in every format imaginable: fillable PDFs where every answer lives in a named form field (14% of the corpus), flattened PDFs where the form structure has been destroyed (84%), scanned images, XLSX files, DOCX files, HTML pages behind JavaScript frameworks, Box embeds, SharePoint pages, and Google Drive shares.
One school hosts their CDS on a Bepress Digital Commons page that intercepts scrapers with a 202 Accepted response and an empty body. Another has a “test draft” at a URL that is actually the real production file. Two different schools shared the same physical file via a common Google Drive link.
None of this is malicious. It's what happens when you release a beautiful canonical template into a distributed system of 800+ institutional research offices, each with their own webmaster, CMS, and IT policies. Over twenty years, the drift is cumulative.
What we built
collegedata.fyiis the index. We discover each school's CDS document, archive the source file immediately (SHA-addressed, preserved forever), extract the numbers into the CDS Initiative's own canonical 1,105-field schema, and expose the result as a queryable API.
The pipeline has five stages:
- Schema pipeline extracts the canonical field definitions from the official XLSX template
- Corpus pipeline builds the school list from IPEDS data and probes for CDS landing pages
- Discovery pipeline crawls IR pages, archives source files to storage
- Extraction pipeline routes each document to a format-specific extractor (fillable PDFs via AcroForm, flat PDFs via Docling)
- Consumer pipeline exposes everything through a public REST API
Open source
The entire project is open source under the MIT license. The code, the schema, the extraction pipeline, and the archived documents are all public.
- GitHub repository
- Public API
- CDS Initiative (the original template publisher)
Credits
Built on Supabase (Postgres, Edge Functions, Storage). Extraction powered by Docling for flattened PDFs. Reducto reference extracts used as a quality benchmark.
The “Common” in Common Data Set is doing a lot of work. We're doing the rest.