Archival Repositories: Where’s the Data?

“Where can we find a comprehensive list of archival repositories in the United States?” This is a question Eira Tansey (University of Cincinnati) and I (Ben Goldman, Penn State University) asked in early 2016 when we started a project to map the vulnerabilities of American archives locations to the future impacts of climate change. With the amazing help of geospatialists at Penn State (Nathan Piekielek, Geospatial Services Librarian) and Tara Mazurczyk (Ph.D. Candidate, Department of Geography) we explored how sea level rise, storm surge, temperature fluctuations and increased precipitation might effect 1,232 archival locations in the continental United States. Eira and I shared initial findings at the Research Forum of the 2017 Society of American Archivists’ Annual Conference, and submitted (with our collaborators) a manuscript for publication this past July.

While we are excited to share the results of this project with our archival colleagues and comrades, one of our lingering disappointments with this effort has been the lack of a comprehensive dataset to work with. The best option we could find came from OCLC’s ArchiveGrid (thanks to Bruce Washburn and Merrilee Proffit), which provided a useful dataset to conduct our research, but clearly did not fully reflect the size of the archival community. By contrast, IMLS’s Museum Universe data file contains over 30,000 entries. It became clear to us that in order to fully understand the future impacts of climate change on documentary heritage in the U.S., we needed better data.

Now, thanks to the Society of American Archivists Foundation, we can begin to pull together a better dataset. Eira and I were awarded a $5,000 Strategic Growth grant from SAA Foundation in May, and over the course of one year (July 2017 – June 2018) are attempting to find, aggregate, standardize and openly share a vast dataset on archival repository locations, with help from an amazing Research Assistant, Whitney Ray.

We hope to use this blog to share our progress and highlight interesting or useful information related to this effort. We welcome the wisdom and comments of our archival colleagues everywhere, so please don’t hesitate to reach out if you have any thoughts or ideas!





On April 26, I participated in the amazing Twitter conference organized by the Society of American Archivists’ Preservation Section. Below is my presentation. If you’d like to see more of the conference (and believe me, you would!), search Twitter for #presTC or visit the schedule here.

Hi #PresTC! I’m the research assistant for the @archivists_org Foundation-funded project to identify, gather, standardize, and make publicly accessible #repodata. This presentation was developed with the grant leads, Ben Goldman and Eira Tansey. 1/Tweet1

There’s no comprehensive dataset of repositories in the US, which puts the archival community at risk for the unevenly distributed effects of climate change. #presTC 2/Tweet2

Our goal: find repository data. #presTC 3/Tweet3

From August to December 2017, we contacted archival organizations, State Historical Records Advisory Boards, state archives, and regional groups. #presTC 4/Tweet4

Data came in a variety of formats, including PDFs, links to archived websites, and lots of spreadsheets. Currently we have 16,326 clean entries #presTC 5/

We’re still processing data and hope to have every state represented in our final set. #presTC 6/Completed-ishStates_2018-04-23

Our findings show just how many archival repositories go under the radar. For example, we’ve identified 240 repositories in FL. In the original dataset, there were 22. #presTC 7/



Our “after” map includes both physical and mailing addresses (total of 243 data pts). That dot in the gulf? It’s actually a PO Box in the FL Keys. #presTC 8/RepoData_2018_AFTER_FloridaData

While this set is still an approximate representation of #repodata, we hope that others can  repurpose it. The data can also be used to identify vulnerable repos when used w/ other climate change tools. #presTC 9/

“What would happen to Wilmington, NC, archival repositories during a Category-1 storm surge and a Category-5 storm surge? Making good use of Eira Tansey’s maps and the Ben Goldman, Eira, and I have been collecting. Archives are the green dots.”


To compare a single repository address with potential sea level rise and flood frequency, you can use the excellent sea-level rise viewer from @noaa: #presTC 10/

Our work also confirms the suspicion that the universe of archives is much, much bigger than is typically acknowledged #presTC 11/Tweet11

Many thanks to @archivists_org Foundation, @SAApreservation, and the archival groups who shared their data! To learn more, you can visit our blog: I’m around to answer questions through the week. #presTC 12/12

Data Decisions

Like many archivists, I have experienced the frustrations and satisfactions that attend the process of moving your institution’s archival data from one system to another, which typically demands a reckoning with past descriptive practices (no judgment; practices evolve) and a better understanding of modern encoding and descriptive standards. My participation in such efforts has given me a deep respect for archivists who understands the ins and outs of these descriptive frameworks and encoding standards.

Alas, I am not one of those archivists (**please keep this in mind as I try my best to explain below our understanding of archival data approaches). So in planning our grant, we spent some time consulting with standards documents and peer experts to ensure that the final dataset models repository locations data in a way that supports maintenance, reuse, and portability to other descriptive frameworks.

Current archival descriptive practice seems somewhat inconclusive (or unemphatic, perhaps?) when it comes to encoding repository information. Describing Archives: A Content Standard (DACS) guidelines for encoding repository information are fairly remedial (see 2.2), suggesting only that an address and contact information may be “desirable”. Section 5 of the International Standard for Describing Institutions with Archival Holdings (ISDIAH) provides considerable guidance on describing a repository’s name, parent institution, identifier, location, services, and mission. One noteworthy area of that section is 5.6 (Control Area), which suggests identifying the sources of your data and the dates of creation or revision (thanks to Mike Rush for pointing this out to us)ISDIAH, we learned, is now (or soon to be) considered superseded by the International Council on Archives’ Records in Context (RiC), which treats repositories as Agents (corporate bodies). Aspects of ISDIAH have seemingly been translated to RiC’s section 3, Properties of Entities — specifically, 3.1 (for names and identifiers), 3.7 (for types of agents), 3.10 (for contact information) and 3.19 (“Properties of Place”, which includes location information and even geographic coordinates.

For a real life example of how all this might translate to an Agent record in Encoded Archival Context (EAC), see below (thanks to Robbie Hott, who helpfully shared some data samples from the Social Networks and Archival Context (SNAC) project):

Screen Shot 2018-03-16 at 12.06.04 PM

Something not really addressed in these examples is the type of archival repository. As we’ve discussed, definitions of “archive” are hard to pin down, but it’s something we’ve been trying to give attention to as we work through the data.

Building on these descriptive approaches, we initially wanted to establish the following data on all repositories:

  • Repository Name
  • Parent Institution Name
  • Authorized Name
  • Authorized Name Source
  • Repository Identifier
  • Repository Identifier Source
  • Location Type (e.g. Mailing Address, Street Address)
  • Street Address 1
  • Street Address 2
  • City
  • State/Province
  • Zip Code
  • Country
  • Longitude
  • Latitude
  • Data Source
  • Date of Entry
  • Name of Person Recording the Entry

We quickly realized that separately gathering authorized names and identifiers from a source like the Library of Congress’s MARC organization codes would be too labor intensive. Longitude and latitude are absolutely critical pieces of data, but not something that was likely to be included in existing datasets. Instead of gathering that data now, our plan is auto-generate the geocoordinates using a service like Geocodio at the end of our project. We were also surprised (though probably should not have been) to discover that some data sources included both a mailing and street address for some repositories. In these cases, we have created two separate entries for the same repository, and created a new field in our data for “location type” to distinguish separate entries for the same repository.

(Geocodio itself provides some interesting data considerations, including the concept of accuracy. For instance, it still generate coordinates for P.O. boxes, but encodes that location as a place rather than a point, and then provides an accuracy score. Geocodio also auto-generates county names, and normalizes address information, which will also come in handy for our project.)

In the end, our final dataset should bring together the name and location elements we’ve described here, aggregated into a single CSV file, which should lend itself to being migrated to other formats and integrated into other systems as needed. We’ll also make the final data accessible through an Open Science Framework or Github repository so that others can continue the work we’ve begun. Other data sources will come to light, or repositories themselves may potentially find errors in our data that require updating. Ongoing maintenance will be required, which is why we hope this data can be stewarded by SAA once we’re done. Regardless of the ultimate disposition, this should be data the community of archival professionals and related stakeholders can make use of in the future.

Making Archives Visible Through Maps

As we previously noted, the only existing open data set for archival repositories – OCLC’s ArchiveGrid – lacks representation of many small archives, historical societies, and other nebulously-defined archives. As many of you know, inclusion in ArchiveGrid is primarily driven by having various descriptive data (MARC records, EAD finding aids, etc) online and crawlable to OCLC. This means that repositories with professional archivists on staff and the resources to make archival description available online are over-represented in the ArchiveGrid data set. In reality, there are many archives that don’t fit this description, and are therefore literally invisible to much of the profession.

This has been frustrating to us as we pursue our work on archival vulnerability to climate change. The institutions that are most at risk for sea-level rise and climate change influenced disasters are also the least likely to have professional staff and sufficient resources to sustain archival collections even in “normal” times – let alone during an emergency. And yet, these are the archives that weren’t visible in our first pass at mapping repository vulnerability to climate change.

But now we’d like to show you the dramatic way in which our research project has uncovered how many archives exist – even if they aren’t putting their finding aids online.

This is the “Before” map, reflecting OCLC’s data – according to ArchiveGrid as of 2016, there are approximately 44 repositories in the state of Ohio:


Although this data is not yet final, this is our beta data set for Ohio – i.e., our “After” map. You can see a dramatic difference in how many more archives have been revealed thanks to our efforts (and especially that of Whitney, our fantastic research assistant, who has done the heavy lifting in reaching out to archival organizations to compile and clean data). According to our preliminary* data, there are well over 500 repositories in the state of Ohio.


I want to highlight that constructing archives as those repositories that participate in networked archival descriptive infrastructure tends to erase the visibility of small archives, especially those outside of major population centers. Let’s use southeastern Ohio – aka Appalachia – as an example.

The light-green counties are those that are part of the federally-defined Appalachian Regional Commission’s jurisdiction. (Clearly there are cultural constructions of Appalachia that do not fit in with these county delineations, but those aren’t as easy to find as open GIS data!)

In the “before” map, only 3 archives exist in Ohio’s Appalachian counties, and they are all associated with higher education: Marietta College, Youngstown State, and Ohio University.


But in the after map, we see that there are roughly 100 (100!!!!!!!!!!) archives in Ohio’s Appalachian counties. Why the massive difference? Because our efforts to get as much data from local, regional and state archival organizations means we have pulled in dozens of small historical societies, public libraries, and museums.


We haven’t done before and after comparisons yet with other states, but I anticipate they would look very similar to what we’ve seen with Ohio. Building the first comprehensive data set of US repositories is no small task, but we think the preliminary results speak for the importance of our work.

*We say preliminary because we still have some cleaning and minor de-duplication tasks left with our data.


Archivists Seeking Data

We have reached out to a seriously large number of archival organizations/societies/consortia since beginning this effort, and our data grows by the week. However, there are a number of organizations we have yet to hear from and our mental archival map of the United States still has some state-sized holes in it. In some cases, we’ve located older online directories, but have not been able to confirm how current they are.

Besides the vagaries of email spam filters, we can think of any number of reasons that these 8 or so organizations would not have had a chance to reach out to us, and we’re mindful that archivists everywhere struggle for time and resources, while trying to meet the every day demands of their researchers and other stakeholders. BUT…

If you are involved in any of the following organizations, or are close buddies with someone who is, please reach out to us and help us put the finishing touches on this massive data collection effort! We’d really appreciate it.

The list:

  • Association of Tribal Archives, Libraries, and Museums
  • Conference of California Historical Societies
  • Friends of the D.C. Archives
  • Indiana State Historical Records Advisory Board/ Indiana Archives and Records Administration
  • L.A. as Subject (has a directory online, but haven’t been able to connect)
  • North Carolina Museums Council
  • Portland Area Archives (has a list of local organizations online but no listed contact information to confirm it’s up to date)
  • Society of Indiana Archivists

A Deluge of Data: Our Mid-year Update

As the remaining calendar days of 2017 dwindle, Eira, Whitney, and I have begun to take stock of our data collection efforts and start planning for the next phase of this project: the wrangling. And boy do we have a lot of data to work through.

In a previous blogpost, Whitney shared details on the outreach phase of our project, and we’d like to share some numbers on our progress. To date, we have contacted 145 archival organizations seeking any data they may have on member institutions and address information. This includes:

  • 51 State Historical Records Advisory Boards (SHRABs) and/or State Archives (including District of Columbia).
  • 11 regional archives associations (multi-state, in most cases), such as the Midwest Archives Conference, or Society of Rocky Mountain Archivists
  • 7 national archives organizations, which tended to be organized by the type of repository (e.g. the Association of Tribal Archives, Libraries, and Museums, and Archivists for Congregations of Women Religious)
  • 45 state-level archives associations (e.g. the Consortium of Iowa Archivists or the Arizona Archivists Alliance)
  • 29 metropolitan/regional archives groups (or areas generally smaller than a state), such the Miami Valley Archivists Roundtable or Chicago Area Medical Archivists
  • 2 individual repositories

Thus far, the response has been encouraging. We have received responses with some form of data from 113 organizations, and 14 organizations directed us to other sources for directories. The remaining organizations have either told us they’re working on compiling a directory or list, or we can’t get a hold of them.

A very, very rough estimate is that we have collected data on over 34,533 archival repositories. We are certain that this number likely includes duplicate data for some repositories, but we won’t be able to ascertain how much overlaps until we dig into further. Still, we are really pleased with this number, which we feel is more broadly representative, in terms of geography and repository type/size, of our professional institutions in the U.S.


The data itself… well, it’s all over the place. And while we expect that some additional locations data could continue to trickle in, our focus will shift in January toward examining the data we have and cleaning it up.

What is an archive?

“What is an archive?” is a seemingly simple question that started gnawing at us when we realized we needed better data that transcended the large research libraries and archives represented in ArchiveGrid. Significant archival material resides outside of the stewardship of a typical institutional archive with dedicated professional staff. Reflecting on my own personal experience, I can think of several examples of “archives without archivists,” which exist outside formal archives:

  • A village public library with one box of transcribed oral histories from a community project…
  • Crates of zines kept in a local radical infoshop…
  • Marriage records kept in the priest’s office of the local parish…

None of these are within formalized archives. Do they count as archives?

This question gets at the distinction between archives as a body of records, and archives as physically-located spaces where one can go to access archival records. Since it is difficult to determine every place in the world that might have “archives, meaning a body of records” (and honestly we would probably get overwhelmed…fast, because what institution wouldn’t meet this definition to some degree?) we are primarily concentrating on “archives as physically-located spaces.”

This then begs a second question…what about archives that have material records that are not documents? As Shannon Mattern recently demonstrated, there are many spaces and institutions which have archives of dirt, ice, and rocks. We even recently learned of a Society of American Archivists workshop that is happening at the Kentucky Geological Society’s Well-Sample and Core Library. Now, all of those places certainly do have documentary records in the form of intake and catalog records about the materials. But the “stuff” of earthly elements are the archives.

As project research assistant Whitney Ray noted in her recent post, our primary approach for this project is recursively identifying all relevant “archival organizations” and using that data for our data set. These by and large represent places that both self-identify as “archives” and hold archives that constitute some kind of human-created  documentary records, as opposed to natural-origin materials.

Trying to externally identify archival spaces that don’t self-identify as such, or obtaining comprehensive data about natural-origin material archives was not part of our original plans. But these questions are now coming up for us on a regular basis. We’re still describing our project as “creating a comprehensive list of archival repositories in the United States,” but it’s clear to us that we probably need to explain in our final documentation when we say “archival repositories,” what we generally mean are “places identified as archives that contain documentary record archives.”




Dear Archivist, or, How I learned to stop worrying and cold-called the U.S. archival community

Hi! My name is Whitney, and I’m very excited to be the Research Assistant helping Ben and Eira with RepoData. Our goal is to create a standardized, centralized, and interoperable data set of archival repositories in the United States. We will use this data set to create a map that depicts possible effects of climate change on archival repositories and their particular vulnerabilities. My job is to gather information about the existence and location of archival repositories.

In early September I began to reach out to archival organizations. I used the Directory of Archival Organizations on the website of the Society of American Archivists (SAA) and list of groups in the Regional Archivists Associations Consortium (RAACs). Using the most up-to-date contact information I could find, either on the SAA site or on the websites of the organizations, I emailed contacts for the archival organizations and RAACs. Since many overlapped, I used my best judgment in first contacting the overarching groups and then, if there appeared to be a gap in our data collection, the sub-groups.

I also began to reach out to state archivists from the Directory of State Archives and representatives from State Historical Records Advisory Boards (SHRABs). The two directories above formed the core of my outreach. However, I also reached out to groups that I found through either recommendation by the organizations or through links on their websites.

In total, I’ve contacted 111 archival groups and SHRAB affiliates. As a team, Ben, Eira, and I have collected information on about 18,000 repositories, although we suspect that a lot of these repositories are duplicate entries: more on this, and what we plan to do about it, in a different blog post.

For me, this has been a learning opportunity in outreach and research on the profession. (My apologies again to the archivist in Oklahoma City whom I emailed twice as representative for two organizations!) It’s been great to see how resourceful archivists are in getting the word out about their collections through affiliation with a group and through descriptive information on their websites.

Next I’ll normalize data fields, categorize repositories, and find latitudinal and longitudinal coordinates. Our plan includes distinguishing between mailing addresses and the physical locations of repositories, particularly since mapping them can tell different stories about vulnerabilities to climate change.