Completing RepoData (for now)

Over the last few months, we’ve been tying up a few loose ends with the RepoData project. These have included:

  • finalizing all the state data (https://github.com/tanseyem/RepoData)
  • working with our friends Ed Summers and Hillel Arnold to aggregate the state data into CSV and JSON files in our github repository
  • creating publicly-accessible documentation about the project (https://osf.io/cft8r/)

I’m happy to announce that we’ve finally completed the last step – making all the data available in a publicly accessible map (https://arcg.is/15mC55). Don’t worry – the actual map does include Alaska and Hawaii!

All told, the Repository Data project gathered 25,000+ data points. Over 18,000 are depicted in the map linked above. We did not include mailing addresses or repositories without street addresses in the map above (since neither are geocodable).

What’s next?

Ultimately we would like to ensure that someone takes over long-term stewardship of the RepoData data set, and we’d love to eventually see the data integrated into a map similar to MuseumStat (http://museumstat.org). Unfortunately, Society of American Archivists (the original funders of the grant which allowed us to create RepoData) is not currently in a position to take over maintenance of the data set, but perhaps this might change in the future. In the mean time, we remain interested in hearing from the larger archivist community about how we might maintain this data and how you are using it.

Advertisement

Tracking Hurricane Florence

We’re at the height of hurricane season, and many people are rightfully concerned about the forecast associated with Hurricane Florence.

While we are still making our way through the data for the remaining states, fortunately we have data available for the current “cone of uncertainty” – Georgia, South Carolina, North Carolina and Virginia.

I made this map that shows the archives in the potential path of Hurricane Florence (at least as of Wednesday afternoon)

I pulled this data into ArcGIS online, and added three layers of information from the National Oceanic and Atmospheric Administration’s information service. NOAA’s NowCoast system provides real time updates, so if you go back and visit this map periodically during the hurricane the NOAA layers will look different each time. You can also toggle the layers on and off.

The three NOAA layers are (screenshots are for illustrative purposes only, because the layers are “live” they will look different depending on when you load the map):

  • Hurricane Forecast Track (explanation here). This layer updates every 10 minutes and tells you the present location of the hurricane, it’s forecasted track, and estimated wind speeds.

Florence-map-1

  • Potential Storm Surge Flooding map (explanation here). The update times on this vary, but once a hurricane event is underway it is several times a day. According to NOAA, this layer “depicts the geographical areas where inundation from storm surge could occur along with the heights, above ground, that water could reach in those areas. These potential heights are represented with different colors based on water level: 1) Greater than 1 foot above ground (blue), 2) Greater than 3 feet above ground (yellow), 3) Greater than 6 feet above ground (orange), and 4) Greater than 9 feet above ground (red). Two versions of this graphic are provided in this map–one with a mask (depicted in gray) identifying Intertidal Zone/Estuarine Wetland areas, and another version without the mask where Intertidal Zone/Estuarine Wetland areas are symbolized with the same colors as other areas.” You have to zoom in closer than the other layers to see this one display. This is Wilmington, North Carolina.

Florence-map-2

 

  • Long Duration Hazardous Weather (explanation here). This layer updates every 5 minutes and is the layer that tells you if a certain area (typically county-level) is under a warning, watch, or other weather advisory.

Florence-map-3

The sources for the current set of archival repository locations in VA, NC, SC, and GA were the Federation of North Carolina Historical Societies, North Carolina Digital Heritage Center, Society of North Carolina Archivists, North Carolina Historical Records Advisory Board, Mid-Atlantic Regional Archives Conference, Georgia Historical Records Advisory Council, Society of Georgia Archivists, Virginia Association of Museums, Library of Virginia, Pennsylvania Historical and Museum Commission, SAA@CUA, Historical Society of Washington, D.C., Charleston Archives, Libraries and Museum Council, and OCLC ArchiveGrid. While these sources represent a wide geographic area within these states, they unfortunately leave some gaps across much of South Carolina outside of Charleston. We have additional data for South Carolina we are currently working through as we can, but unfortunately it is somewhat messy and time-consuming to process.

These maps should be used for informational purposes only and not emergency planning. We do hope our comrades in the profession find them useful. If you have any questions or suggestions, please contact me (Eira Tansey) via email.

 

Making Our Data Publicly Available (version 0.1)

Over the weeks and months we’ve been sharing details of our effort, we’ve received many inquiries from interested people asking questions like, “Is my archive represented in your data?” or “What repositories have you found for Florida?” or “When can I view the data you’ve collected?” We even had one inquiry from a private individual who was looking for a good historical society to donate material to! All of these inquiries have been welcome, and we’re pleased to hear of any interest in our work. Now we’d like to begin the rollout of data, but some caveats are in order.

First, we’re not yet done. This is a very preliminary and imperfect initial release. Consider this first phase a Beta release, which we are sharing in conjunction with a public presentation of our work at this year’s RBMS Conference in New Orleans. This first release includes data on approximately 12,000 repository locations from 30 states, plus the District of Columbia.

Second, while many of the state data we’re sharing represents the entirety of the data we gathered from our 100+ sources, some of these states have additional sources that we have yet to incorporate. In all these cases, the unincorporated data is not structured or formatted in a way that lends itself to easy manipulation (for example, a list of organizations and their websites, which requires a fair amount of searching and browsing/cutting and pasting (oh, the glamorous data work!) to get into our spreadsheets).

We will happily share a larger set of data in the coming months as we wrap up other aspects of our work, and those releases will include additional information on our project. For now, if you’d like to review this initial release, you can visit our Github repository at: https://github.com/tanseyem/RepoData. While we welcome any corrections or additions you may be able to offer, we may not be able to address such feedback until after our remaining project work is done.

Below is a list of the states (and one district) we are sharing with this release:

  • Alabama
  • Arkansas
  • California
  • Colorado
  • Connecticut
  • Delaware
  • Florida
  • Georgia
  • Hawaii
  • Idaho
  • Illinois
  • Indiana
  • Kansas
  • Kentucky
  • Louisiana
  • Maryland
  • Michigan
  • Minnesota
  • Mississippi
  • Nebraska
  • North Carolina
  • Ohio
  • Oklahoma
  • Tennessee
  • Texas
  • Utah
  • Virginia
  • Washington
  • Washington, D.C.
  • West Virginia
  • Wyoming

#presTC

On April 26, I participated in the amazing Twitter conference organized by the Society of American Archivists’ Preservation Section. Below is my presentation. If you’d like to see more of the conference (and believe me, you would!), search Twitter for #presTC or visit the schedule here.

Hi #PresTC! I’m the research assistant for the @archivists_org Foundation-funded project to identify, gather, standardize, and make publicly accessible #repodata. This presentation was developed with the grant leads, Ben Goldman and Eira Tansey. 1/Tweet1

There’s no comprehensive dataset of repositories in the US, which puts the archival community at risk for the unevenly distributed effects of climate change. #presTC 2/Tweet2

Our goal: find repository data. #presTC 3/Tweet3

From August to December 2017, we contacted archival organizations, State Historical Records Advisory Boards, state archives, and regional groups. #presTC 4/Tweet4

Data came in a variety of formats, including PDFs, links to archived websites, and lots of spreadsheets. Currently we have 16,326 clean entries #presTC 5/

We’re still processing data and hope to have every state represented in our final set. #presTC 6/Completed-ishStates_2018-04-23

Our findings show just how many archival repositories go under the radar. For example, we’ve identified 240 repositories in FL. In the original dataset, there were 22. #presTC 7/

 

 

Our “after” map includes both physical and mailing addresses (total of 243 data pts). That dot in the gulf? It’s actually a PO Box in the FL Keys. #presTC 8/RepoData_2018_AFTER_FloridaData

While this set is still an approximate representation of #repodata, we hope that others can  repurpose it. The data can also be used to identify vulnerable repos when used w/ other climate change tools. #presTC 9/

“What would happen to Wilmington, NC, archival repositories during a Category-1 storm surge and a Category-5 storm surge? Making good use of Eira Tansey’s maps and the Ben Goldman, Eira, and I have been collecting. Archives are the green dots.”

RepoData_NorthCarolina_2018-03-16_Wilmington

To compare a single repository address with potential sea level rise and flood frequency, you can use the excellent sea-level rise viewer from @noaa: https://coast.noaa.gov/slr/ #presTC 10/

Our work also confirms the suspicion that the universe of archives is much, much bigger than is typically acknowledged #presTC 11/Tweet11

Many thanks to @archivists_org Foundation, @SAApreservation, and the archival groups who shared their data! To learn more, you can visit our blog: https://repositorydata.wordpress.com/ I’m around to answer questions through the week. #presTC 12/12

Data Decisions

Like many archivists, I have experienced the frustrations and satisfactions that attend the process of moving your institution’s archival data from one system to another, which typically demands a reckoning with past descriptive practices (no judgment; practices evolve) and a better understanding of modern encoding and descriptive standards. My participation in such efforts has given me a deep respect for archivists who understands the ins and outs of these descriptive frameworks and encoding standards.

Alas, I am not one of those archivists (**please keep this in mind as I try my best to explain below our understanding of archival data approaches). So in planning our grant, we spent some time consulting with standards documents and peer experts to ensure that the final dataset models repository locations data in a way that supports maintenance, reuse, and portability to other descriptive frameworks.

Current archival descriptive practice seems somewhat inconclusive (or unemphatic, perhaps?) when it comes to encoding repository information. Describing Archives: A Content Standard (DACS) guidelines for encoding repository information are fairly remedial (see 2.2), suggesting only that an address and contact information may be “desirable”. Section 5 of the International Standard for Describing Institutions with Archival Holdings (ISDIAH) provides considerable guidance on describing a repository’s name, parent institution, identifier, location, services, and mission. One noteworthy area of that section is 5.6 (Control Area), which suggests identifying the sources of your data and the dates of creation or revision (thanks to Mike Rush for pointing this out to us)ISDIAH, we learned, is now (or soon to be) considered superseded by the International Council on Archives’ Records in Context (RiC), which treats repositories as Agents (corporate bodies). Aspects of ISDIAH have seemingly been translated to RiC’s section 3, Properties of Entities — specifically, 3.1 (for names and identifiers), 3.7 (for types of agents), 3.10 (for contact information) and 3.19 (“Properties of Place”, which includes location information and even geographic coordinates.

For a real life example of how all this might translate to an Agent record in Encoded Archival Context (EAC), see below (thanks to Robbie Hott, who helpfully shared some data samples from the Social Networks and Archival Context (SNAC) project):

Screen Shot 2018-03-16 at 12.06.04 PM

Something not really addressed in these examples is the type of archival repository. As we’ve discussed, definitions of “archive” are hard to pin down, but it’s something we’ve been trying to give attention to as we work through the data.

Building on these descriptive approaches, we initially wanted to establish the following data on all repositories:

  • Repository Name
  • Parent Institution Name
  • Authorized Name
  • Authorized Name Source
  • Repository Identifier
  • Repository Identifier Source
  • Location Type (e.g. Mailing Address, Street Address)
  • Street Address 1
  • Street Address 2
  • City
  • State/Province
  • Zip Code
  • Country
  • Longitude
  • Latitude
  • Data Source
  • Date of Entry
  • Name of Person Recording the Entry

We quickly realized that separately gathering authorized names and identifiers from a source like the Library of Congress’s MARC organization codes would be too labor intensive. Longitude and latitude are absolutely critical pieces of data, but not something that was likely to be included in existing datasets. Instead of gathering that data now, our plan is auto-generate the geocoordinates using a service like Geocodio at the end of our project. We were also surprised (though probably should not have been) to discover that some data sources included both a mailing and street address for some repositories. In these cases, we have created two separate entries for the same repository, and created a new field in our data for “location type” to distinguish separate entries for the same repository.

(Geocodio itself provides some interesting data considerations, including the concept of accuracy. For instance, it still generate coordinates for P.O. boxes, but encodes that location as a place rather than a point, and then provides an accuracy score. Geocodio also auto-generates county names, and normalizes address information, which will also come in handy for our project.)

In the end, our final dataset should bring together the name and location elements we’ve described here, aggregated into a single CSV file, which should lend itself to being migrated to other formats and integrated into other systems as needed. We’ll also make the final data accessible through an Open Science Framework or Github repository so that others can continue the work we’ve begun. Other data sources will come to light, or repositories themselves may potentially find errors in our data that require updating. Ongoing maintenance will be required, which is why we hope this data can be stewarded by SAA once we’re done. Regardless of the ultimate disposition, this should be data the community of archival professionals and related stakeholders can make use of in the future.

Making Archives Visible Through Maps

As we previously noted, the only existing open data set for archival repositories – OCLC’s ArchiveGrid – lacks representation of many small archives, historical societies, and other nebulously-defined archives. As many of you know, inclusion in ArchiveGrid is primarily driven by having various descriptive data (MARC records, EAD finding aids, etc) online and crawlable to OCLC. This means that repositories with professional archivists on staff and the resources to make archival description available online are over-represented in the ArchiveGrid data set. In reality, there are many archives that don’t fit this description, and are therefore literally invisible to much of the profession.

This has been frustrating to us as we pursue our work on archival vulnerability to climate change. The institutions that are most at risk for sea-level rise and climate change influenced disasters are also the least likely to have professional staff and sufficient resources to sustain archival collections even in “normal” times – let alone during an emergency. And yet, these are the archives that weren’t visible in our first pass at mapping repository vulnerability to climate change.

But now we’d like to show you the dramatic way in which our research project has uncovered how many archives exist – even if they aren’t putting their finding aids online.

This is the “Before” map, reflecting OCLC’s data – according to ArchiveGrid as of 2016, there are approximately 44 repositories in the state of Ohio:

ArchiveGrid_2016_BEFORE_OhioData

Although this data is not yet final, this is our beta data set for Ohio – i.e., our “After” map. You can see a dramatic difference in how many more archives have been revealed thanks to our efforts (and especially that of Whitney, our fantastic research assistant, who has done the heavy lifting in reaching out to archival organizations to compile and clean data). According to our preliminary* data, there are well over 500 repositories in the state of Ohio.

RepoData_2018_AFTER_OhioData

I want to highlight that constructing archives as those repositories that participate in networked archival descriptive infrastructure tends to erase the visibility of small archives, especially those outside of major population centers. Let’s use southeastern Ohio – aka Appalachia – as an example.

The light-green counties are those that are part of the federally-defined Appalachian Regional Commission’s jurisdiction. (Clearly there are cultural constructions of Appalachia that do not fit in with these county delineations, but those aren’t as easy to find as open GIS data!)

In the “before” map, only 3 archives exist in Ohio’s Appalachian counties, and they are all associated with higher education: Marietta College, Youngstown State, and Ohio University.

ArchiveGrid_2016_BEFORE_ARCcounties_OhioData

But in the after map, we see that there are roughly 100 (100!!!!!!!!!!) archives in Ohio’s Appalachian counties. Why the massive difference? Because our efforts to get as much data from local, regional and state archival organizations means we have pulled in dozens of small historical societies, public libraries, and museums.

RepoData_2018_AFTER_ARCcounties_OhioData

We haven’t done before and after comparisons yet with other states, but I anticipate they would look very similar to what we’ve seen with Ohio. Building the first comprehensive data set of US repositories is no small task, but we think the preliminary results speak for the importance of our work.

*We say preliminary because we still have some cleaning and minor de-duplication tasks left with our data.

 

Archivists Seeking Data

We have reached out to a seriously large number of archival organizations/societies/consortia since beginning this effort, and our data grows by the week. However, there are a number of organizations we have yet to hear from and our mental archival map of the United States still has some state-sized holes in it. In some cases, we’ve located older online directories, but have not been able to confirm how current they are.

Besides the vagaries of email spam filters, we can think of any number of reasons that these 8 or so organizations would not have had a chance to reach out to us, and we’re mindful that archivists everywhere struggle for time and resources, while trying to meet the every day demands of their researchers and other stakeholders. BUT…

If you are involved in any of the following organizations, or are close buddies with someone who is, please reach out to us and help us put the finishing touches on this massive data collection effort! We’d really appreciate it.

The list:

  • Association of Tribal Archives, Libraries, and Museums
  • Conference of California Historical Societies
  • Friends of the D.C. Archives
  • Indiana State Historical Records Advisory Board/ Indiana Archives and Records Administration
  • L.A. as Subject (has a directory online, but haven’t been able to connect)
  • North Carolina Museums Council
  • Portland Area Archives (has a list of local organizations online but no listed contact information to confirm it’s up to date)
  • Society of Indiana Archivists

A Deluge of Data: Our Mid-year Update

As the remaining calendar days of 2017 dwindle, Eira, Whitney, and I have begun to take stock of our data collection efforts and start planning for the next phase of this project: the wrangling. And boy do we have a lot of data to work through.

In a previous blogpost, Whitney shared details on the outreach phase of our project, and we’d like to share some numbers on our progress. To date, we have contacted 145 archival organizations seeking any data they may have on member institutions and address information. This includes:

  • 51 State Historical Records Advisory Boards (SHRABs) and/or State Archives (including District of Columbia).
  • 11 regional archives associations (multi-state, in most cases), such as the Midwest Archives Conference, or Society of Rocky Mountain Archivists
  • 7 national archives organizations, which tended to be organized by the type of repository (e.g. the Association of Tribal Archives, Libraries, and Museums, and Archivists for Congregations of Women Religious)
  • 45 state-level archives associations (e.g. the Consortium of Iowa Archivists or the Arizona Archivists Alliance)
  • 29 metropolitan/regional archives groups (or areas generally smaller than a state), such the Miami Valley Archivists Roundtable or Chicago Area Medical Archivists
  • 2 individual repositories

Thus far, the response has been encouraging. We have received responses with some form of data from 113 organizations, and 14 organizations directed us to other sources for directories. The remaining organizations have either told us they’re working on compiling a directory or list, or we can’t get a hold of them.

A very, very rough estimate is that we have collected data on over 34,533 archival repositories. We are certain that this number likely includes duplicate data for some repositories, but we won’t be able to ascertain how much overlaps until we dig into further. Still, we are really pleased with this number, which we feel is more broadly representative, in terms of geography and repository type/size, of our professional institutions in the U.S.

 

The data itself… well, it’s all over the place. And while we expect that some additional locations data could continue to trickle in, our focus will shift in January toward examining the data we have and cleaning it up.

What is an archive?

“What is an archive?” is a seemingly simple question that started gnawing at us when we realized we needed better data that transcended the large research libraries and archives represented in ArchiveGrid. Significant archival material resides outside of the stewardship of a typical institutional archive with dedicated professional staff. Reflecting on my own personal experience, I can think of several examples of “archives without archivists,” which exist outside formal archives:

  • A village public library with one box of transcribed oral histories from a community project…
  • Crates of zines kept in a local radical infoshop…
  • Marriage records kept in the priest’s office of the local parish…

None of these are within formalized archives. Do they count as archives?

This question gets at the distinction between archives as a body of records, and archives as physically-located spaces where one can go to access archival records. Since it is difficult to determine every place in the world that might have “archives, meaning a body of records” (and honestly we would probably get overwhelmed…fast, because what institution wouldn’t meet this definition to some degree?) we are primarily concentrating on “archives as physically-located spaces.”

This then begs a second question…what about archives that have material records that are not documents? As Shannon Mattern recently demonstrated, there are many spaces and institutions which have archives of dirt, ice, and rocks. We even recently learned of a Society of American Archivists workshop that is happening at the Kentucky Geological Society’s Well-Sample and Core Library. Now, all of those places certainly do have documentary records in the form of intake and catalog records about the materials. But the “stuff” of earthly elements are the archives.

As project research assistant Whitney Ray noted in her recent post, our primary approach for this project is recursively identifying all relevant “archival organizations” and using that data for our data set. These by and large represent places that both self-identify as “archives” and hold archives that constitute some kind of human-created  documentary records, as opposed to natural-origin materials.

Trying to externally identify archival spaces that don’t self-identify as such, or obtaining comprehensive data about natural-origin material archives was not part of our original plans. But these questions are now coming up for us on a regular basis. We’re still describing our project as “creating a comprehensive list of archival repositories in the United States,” but it’s clear to us that we probably need to explain in our final documentation when we say “archival repositories,” what we generally mean are “places identified as archives that contain documentary record archives.”