GitHub Archive Program FAQ
What public repositories are getting archived?
On February 2, 2020 at 2 pm PT, we will begin snapshotting all of GitHub’s public repositories that have been active within recent months. Additionally, a team of chosen experts and advisors will identify important inactive projects to be added to the archive. To ensure your repository is included, update your repository, clean up your README, and push a commit sometime before February 2.
How do I create something new to ensure it’s added to the archive?
Create a new public repository, and add the software or content you’d like to have archived. Any new, free, public repository with recent commits in the months before February 2nd will be added to the GitHub Arctic Code Vault and available in partner archives through the GitHub Archive Program.
What does it mean for my project to be archived?
There are multiple layers to the GitHub Archive Program. First, we have community efforts like GH Archive and GH Torrent that provide “hot storage” archiving GitHub events (like new Issues, updated pull requests, and much more) in near real time. Then, we work with partners such as the Internet Archive and the Software Heritage Foundation. These layers of “warm storage” crawl every public repository, complete with its history, issues, pull requests, and other meta-data, archive them, and update them on the interval required by their layer (days, months, or years).
In addition to the warm layers, your repository will be stored in the GitHub Arctic Code Vault at the Arctic World Archive in Svalbard. The GitHub Arctic Code Vault stores the latest revision of your project’s default branch (potentially excluding large binary files depending on the overall size of the repository) at the time of capture. This “cold storage” is designed to last for 1,000 years.
What if I want to opt out?
GitHub will archive only public repositories, so opting out is as simple as making your repository private (which is free for all users).
Can I access my repository as part of each archive?
For the warm storage, you can access the archived copies of your repositories by going to the GitHub Archive Program site and following the content from each partner.
Due to the nature of the GitHub Arctic Code Vault and other cold storage, it is not possible for the public to access the code stored there. However, if you go to the GitHub Archive Program website once the data is stored, we will tell you how to figure out where your data is stored
How do you handle repositories with dependencies, like libraries? What about build system? Will you archive everything needed to build my repository?
We will archive the code at the state of HEAD on the default branch of your repository. If you include your dependencies within your repository, those will be included (with the exception of large binary files). The Tech Tree (see below) will also describe the importance of dependencies and how to locate dependencies within the various languages.
If your project’s dependencies are open -source projects on GitHub, they will be automatically stored in the same way as your project (see question 2); otherwise, you need to add them into your repository or create a mirror on GitHub.
What happens if I find a bug or vulnerability in my project after it has been archived?
For the hot and warm archives, your code will be updated by each archive to include any fixes on their next refresh cycle. However, due to the nature of cold archives, your repository will be stored there in the state it is on February 2nd, 2020.
Are the repositories in the GitHub Arctic Code Vault at the Arctic World Archive ever updated?
We plan to evaluate the program, and the state of the art of archival technology, every five years. Depending on the results of each evaluation, we may then decide to take another snapshot of GitHub’s public code and archive it in cold storage.
If I delete my repositories from GitHub, will it eventually be deleted from all warm storage partners?
Keeping a historic view is an important part of each archive. If you have a concern about your repository continuing to be a part of the archive, please contact the archives. For the GitHub Arctic Code Vault, we are unable to remove data that has already been stored.
Can you explain how GDPR works for repositories in the cold or warm storage, for example for personal data in the git commit message (author name and email)? Have you reviewed this with your legal team and cleared all open source projects of any liabilities?
The cold storage contains only a snapshot of the code, so individual commit messages are not captured, although each repository also has a list of contributors appended to it. Warm storage contains more thorough information, but archives have a special legal status under GDPR which protects them. GitHub’s legal team has reviewed the Archive Program.
Can I visit GitHub Arctic Code Vault at the Arctic World Archive?
Unfortunately, in order to maintain the security of the archive, the Arctic World Archive is unable to provide tours. Mine No. 3, in which the AWA is set, provides tours which can be booked by visiting visitsvalbard.com. We invite you to watch our video announcing the GitHub Arctic Code Vault for more information.
It is possible to do tours of a coal mine, Mine 3 in Longyearbyen, Svalbard, see visitsvalbard.com. When production was stopped in Mine 3 in November 1996, all the machinery and equipment were left behind in the facility. You can expect an authentic tour, walking 250m into one of the main tunnels of the mine and seeing the origins of the first seed bank in one of the side tunnels. It’ is not possible to visit the Global Seed Vault or the Arctic World Archive.
How is the Arctic World Archive protected against climate change, natural disasters, or outside access?
The Arctic World Archive is located within a decommissioned mine located 104 meters above sea level. It is 350 meters deep inside permafrost, built with an incline and decline within the mine tunnel such that water will naturally flow out of the mine. The archive chamber has been reinforced with steel rebar and does not require electricity to operate. The mine is still being maintained and secured by Store Norske in cooperation with the Norwegian Government and the Arctic World Archive. The mine itself has multiple levels of physical access control before getting to the archive which has additional physical protections.
Why does GitHub need an archive program? Isn’t GitHub having multiple data centers and is protected against downtimes and server crashes?
GitHub’s backup strategy is meant to protect your day-to-day work and code within GitHub. This includes multiple levels of data centers, redundancy, and backup systems to keep your data safe and secure. The goal of the GitHub Archive Program is to provide access for academic research and humanitarian interests in the state of public software, both in the near term via the warm layer and in the long term via the cold layer.
What is the “Tech Tree”? Why is the “Tech Tree” necessary?
The Tech Tree gives an overview of the archive, describes the structure of the archive, how to use it and extract projects, directories, files, and data formats, and how to use dependencies. It is essentially a quick start into software development and computing bundled with a user guide for the archive. It will also include elements of the Long Now Foundation’s “Manual for Civilization.”
To build the Tech Tree, we are working with open source community maintainers, archivists, archeologists, anthropologists, sociologists, and more from professional and academic backgrounds. We’re excited to share more about how you can get involved in January 2020.