Restoring Archived Files from Amazon Glacier

The Problem

A few weeks ago I got into a situation in which I needed to restore a large number of files from Amazon’s then-new service: S3 Archive (Glacier). These objects (Amazon’s word for files), are not standard Glacier objects, but are S3 objects that have been transitioned to long-term storage. You can read more about Glacier here.

The Other Problem

Normally you could restore these objects in the AWS console, but being a new service, Amazon didn’t consider the use case of restoring files in bulk across the multiple directories presented in the AWS console. Yes, I know S3 doesn’t have directories, they only have objects. However, the objects are presented as if they were folders in the AWS management console. Every slash is translated into a directory that can be traversed and scrolled though. You can scroll though directories using their awkward, slow web interface that the returns to the top of the listing every time you navigate in and out of a folder, making the prospect of restoring these files a multi-month proposition. This meant that the console was out of the question. So that meant writing a script using the Amazon Web Services API.

The Solution

I had two options for writing a script to restore these objects:

  1. Use the Java SDK provided by Amazon
  2. Write something myself since the Amazon Ruby SDKs are out of date and the existing Ruby libraries had not yet been updated to work with S3 archived object and don’t offer you a way to make an arbitrary POST request (that I could tell).

So I set out to use the standard S3 API v2 (which I was using in the app I was working on) to restore these objects using a signed POST request to object_name?restore to the specified object. Below is that script I cranked out during a sleepless state of delirium.

Usage:

Make a file called files_to_restore.txt in the same directory as the glacier_restore.rb file shown in the gist below. Add your files in the format (one object per line, no leading slash, no bucket name):

1
2
sub/folder/object_1.ext
another/sub/folder/object_2.ext

run:

1
ruby glacier_restore.rb > out.txt 2>err.txt

or, to resume a stopped job:

1
ruby glacier_restore.rb >> out.txt 2>>err.txt

tl;dr

This script can be used to restore Amazon S3 objects that were archived using the lifecycle feature of Amazon S3.

Disclaimer

This is an ugly piece of code that I found useful for a short period of time. Hence I do not intend to maintain this script. I share it only in the hopes that restoring files from S3 archive is less painful for you than it was for me. I used this script to restore tens of thousands of objects before Amazon was able to step in and take over with a script that they had written themselves. Best of luck!

License

Copyright © 2012 Sascha Winter

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Further Reading:

See the RESTObjectPOSTrestore and restoring-objects for more info.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#!/usr/bin/env ruby
require 'base64'
require 'openssl'
require 'digest/sha1'
require 'net/http'
require "uri"
require 'time'

DEBUG = false

class GlacierRestore
  def initialize
    @request_params ='?restore'
    @bucket = '[[bucket name]]'
    @host = "#{@bucket}.s3.amazonaws.com"
    @access_key_id = '[[access key id]]'
    @secret_access_key = '[[your secret access key]]'
    @post_body = "<RestoreRequest>\n  <Days>[[num_days]]</Days>\n</RestoreRequest>"
    @md5 = Digest::MD5.base64digest(@post_body)
    @content_type = "text/xml"
  end

  def restore files
    files.each do |file|
      restore_file file.chomp
    end
  end

  def restore_file file_name
    @date = Time.now.httpdate
    @canonicalized_resource = "/#{file_name}#{@request_params}"
    string_to_sign = "POST\n#{@md5}\n#{@content_type}\n#{@date}\n/#{@bucket}#{@canonicalized_resource}"
    @signature = signature string_to_sign

    uri = URI.parse("http://#{@host}")
    http = Net::HTTP.new(uri.host, uri.port)
    request = Net::HTTP::Post.new("http://#{@host}#{@canonicalized_resource}")
    request.add_field('Host', @host)
    request.add_field('Authorization', "AWS #{@access_key_id}:#{@signature}")
    request.add_field('Content-Type', @content_type)
    request.add_field('Content-Length', @post_body.length)
    request.add_field('Date', @date)
    request.add_field('Content-MD5', @md5)
    request.body = @post_body
    response = http.request(request)

    divider = "\n===========\n"
    if DEBUG
      $stderr.puts "Headers:\n "
      $stderr.puts "#{request.to_hash}"
      $stderr.puts divider
      $stderr.puts "Date:\n'#{@date}'"
      $stderr.puts divider
      $stderr.puts "Object name:\n'#{file_name}'"
      $stderr.puts divider
      $stderr.puts "Canonicalized Resource:\n'#{@canonicalized_resource}'"
      $stderr.puts divider
      $stderr.puts "String to sign:\n'#{string_to_sign}'"
      $stderr.puts divider
      $stderr.puts "POST body:\n'#{@post_body}'"
      $stderr.puts divider
      $stderr.puts "Response body:\n'#{response.body}'"
      $stderr.puts divider
      $stderr.puts "Response message:\n'#{response.message}'"
      $stderr.puts divider
      $stderr.puts "Response code:\n'#{response.code}'"
      $stderr.puts divider
      $stderr.puts "Signature:\n'#{@signature}'"
    else
      $stderr.puts "#{response.code} #{response.message} #{response.body.gsub(/<.*?>/, ' ').gsub(/ +/, ' ')}"
    end
    if [200, 202, 409].include?(response.code.to_i)
      $stdout.puts file_name
    end
  end

  def signature string_to_sign
    Base64.encode64(
      OpenSSL::HMAC.digest(
        OpenSSL::Digest::Digest.new('sha1'),
        @secret_access_key, string_to_sign)
    ).chomp
  end
end

if __FILE__ == $0
  glacier = GlacierRestore.new
  files = File.read('files_to_restore.txt').each_line.to_a
  $stderr.puts "Restoring #{files.count} files"
  #if you only want to parse some of the lines
  #files = files[1..10]
  glacier.restore(files)
end