AWS Storage Option: S3

AWS whitepaper is best source of storage options.  I recommend to read this post that gives quick overview of Storage options in AWS

S3

Concurrent read or write access to data by many separate clients or application threads Amazon S3 is universal namespace.  That is, name must be unique globally.  Charged for Storage, Requests, Tags, Cross region replication, and Transfer Acceleration.  Write, read, and delete objects containing from zero to 5 TB of data.  

Data: Simple Key-Value store: Key (name of object and Objects (name) can be sorted lexicographically – log files sorted according to date) and Value (Data – Sequence of bytes).  For each object stored in a bucket, Amazon S3 maintains a set of system metadata. Metadata such as object creation date, storage class configured, server-side encryption, object size (content-length), and so on. Data Consistency:
  • Read after write consistency for PUTS of new objects – As soon a new object is put, we get immediate consistency, i.e., can read object immediately
  • Eventual consistency for overwrites PUTS and DELETES – Changes are not immediate if object is updated or deleted – some time to propagate.  
  • Updates are atomic – While reading either old or new state neither partial nor intermediate state
Amazon S3 offers a range of storage classes designed for different use cases including the following:

  • S3 Standard, for general-purpose storage of frequently accessed data. Data is stored across multiple devices and facilities within region
  • S3 Standard-Infrequent Access (Standard-IA), for long-lived, but less frequently accessed data.
  • S3 One Zone – Infrequent Access, Unlike other Amazon object storage classes, which store data in a minimum of three Availability Zones (AZs), S3 One Zone-IA stores data in a single AZ.  Data will be lost in the event of Availability Zone destruction.
  • Glacier,  secure, durable, and extremely low-cost storage service for data archiving.  Data Retrieval
    • Standard, access your data within just a few hours (3-5 hours)
    • Bulk, Bulk retrievals are Glacier’s lowest-cost retrieval option, enabling you to retrieve large amounts, even petabytes, of data inexpensively in a day. Bulk retrievals typically complete within 5 – 12 hours
    • Expedited, Expedited retrievals allow you to quickly access your data when occasional urgent requests for a subset of archives are required. For all but the largest archives (250MB+), data accessed using Expedited retrievals are typically made available within 1 – 5 minutes.

Lifecycle Rules: 

Define how Amazon S3 manages objects. Transition to Standard-IA after 30 days, Transition to Glacier after 60 days, Permanently delete after 90 days.  
  • Can be used in conjunction with versioning (both current and previous versions)
  • Automate transition to tiered storage (S3-IA, Glacier)
  • Expire objects based on retention needs or clean up incomplete multipart uploads

Cost: 

  • Storage pricing
  • Request pricing - Max for Standard storage
  • Storage Management pricing: S3 Inventory, Storage class Analysis, and Object Tagging
  • Data Transfer pricing: Transfer out from S3 to Internet and Transfer out from S3 to other AWS services
  • Transfer Acceleration pricing: Transfer IN to S3 from internet, Transfer OUT from S3 to internet, and Transfer between S3 and another AWS region
  • Cross Region Replication pricing

Storage class Analysis

Four S3 features that will give you detailed insights into your storage and your access patterns:
  • S3 Analytics – Analyze the storage and retrieval patterns for your objects and use the results to choose the most appropriate storage class. Inspect the results of the analysis from within the S3 Console, or you can load them into your favourite BI tool and dive deep. 
  • S3 Object Tagging – The tags can be used to manage and control access, set up S3 Lifecycle policies, customize the S3 Analytics, and filter the CloudWatch metrics. 
  • S3 Inventory – With S3 Inventory, you can now arrange to receive daily or weekly inventory reports for any of your buckets. You can use a prefix to filter the report and you can choose to include optional fields such as size, storage class, and replication status. Reports can be sent to an S3 bucket in your account or (with proper permission settings) in another account. 
  • S3 CloudWatch Metrices - S3 can now publish storage, request, and data transfer metrics to CloudWatch. The storage metrics are reported daily and are available at no extra cost. The request and data transfer metrics are available at one minute intervals and are billed at the standard CloudWatch rate. The metrics are available within the S3 and CloudWatch consoles.
In order to make it easier to take advantage of S3 different storage classes without understanding completely the access patterns, AWS has launched S3 Intelligent tiering.  

S3 logging

Server access logging provides detailed records for the requests that are made to a bucket. Server access logs are useful for many applications. For example, access log information can be useful in security and access audits. It can also help you learn about your customer base and understand your Amazon S3 bill.
CloudTrail is very API focussed.  CloudTrail log files contain one or more log entries. An event represents a single request from any source and includes information about the requested action, the date and time of the action, request parameters, and so on. CloudTrail log files are not an ordered stack trace of the public API calls, so they do not appear in any specific order.

Versioning

In one bucket is versioning enabled, we can have two objects with the same key, but different version IDs, such as photo.gif (version 111111) and photo.gif (version 121212).  A simple delete of an object: Amazon S3 inserts a delete marker, which becomes the current object version. The delete marker makes Amazon S3 behave as if it had been deleted; but, object is not deleted.  To restore the object, delete the delete marker.  To permanently delete versioned objects, you must use DELETE Object versionId.

Security

  • Manage access to Amazon S3 by granting other AWS accounts and users permission to perform the resource operations by writing an access policy
    • Resource based policy: 
      • ACL (each bucket and object has ACL – XML Schema associated with it)
      • Bucket policy (express bucket policy in JSON)
    • User policies – Create IAM users, groups, and roles and attach access policies to them granting them access to Amazon S3
  • Protect Amazon S3 data at rest by using server-side encryption, in which, request Amazon S3 to encrypt object before it’s written to disks in data centres and decrypt it when one downloads the object
    • Amazon S3 managed keys (SSE-S3): Each object is encrypted with a unique key employing strong multi-factor encryption. As an additional safeguard, it encrypts the key itself with a master key(envelop key) that it regularly rotates. Amazon S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt your data.
    • AWS KMS managed keys (SSE-KMS) – Similar to SSE-S3 with additional benefits: provides audit trail of when key is used, separate permissions for envelop, and option for clients to manage keys themselves
    • Customer provided keys (SSE-C) – Clients manage keys and AWS manages encryption and decryption
  • Use client-side encryption, in which data is encrypted by client and uploaded to S3
  • Protect the data in transit by using SSL
  • Optionally enable MFA for bucket
  • Enable access logging 
  • Use versioning to preserve, retrieve, and restore every version of every object

Performance

  • Choose a region: bucket’s proximity to future clients (human users as well AWS services). Access to S3 from EC2 in same region is designed to be fast
  • Choose an object key: Add randomness to the beginning of the key name
  • Optimizing PUT: To improve the upload performance of large objects (typically over 100 MB), Amazon S3 offers a multipart upload command to upload a single object as a set of parts in parallel. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object
  • Optimizing GET:
    • Access S3 using multiple threads, multiple applications, or multiple clients concurrently
    • The most obvious optimization when reading objects from S3 is using AWS CloudFront.
    • Amazon S3 supports the BitTorrent protocol so that developers can save costs when distributing content at high scale and costs of client/server distribution increase linearly as the number of users downloading objects increases.  BitTorrent addresses this problem by recruiting the very clients that are downloading the object as distributors themselves
  • Amazon S3 Transfer Acceleration enables fast, easy, and secure transfer of files over long distances between your client and your Amazon S3 bucket. It leverages Amazon CloudFront globally distributed edge locations to route traffic to your Amazon S3 bucket over an Amazon-optimized network path. To get started with Amazon S3 Transfer Acceleration you first must enable it on an Amazon S3 bucket. Then modify your Amazon S3 PUT and GET requests to use the s3-accelerate endpoint domain name (.s3-accelerate.amazonaws.com).  Users across world don’t upload files directly to S3 bucket instead upload to the edge location near to them. Amazon has better communication links between edge locations and their data centres

CloudSearch

To speed up access to relevant data, many developers pair Amazon S3 with a search engine such as Amazon CloudSearch or a database such as Amazon DynamoDB or Amazon RDS. In these scenarios, Amazon S3 stores the actual information, and the search engine or database serves as the repository for associated metadata

Replication

Enable automatic and asynchronous copying of objects across buckets in different AWS regions.  One cannot replicate to multiple buckets or use daisy chaining.  In order to replicate objects to multiple destination buckets or destination buckets in the same region as the source bucket, customers must spin up custom compute resources to manage and execute the replication.  To help customers more proactively monitor the replication status of their Amazon S3 objects, AWS offers the Cross-Region Replication Monitor (CRR Monitor) solution

Use Cases

  • Store and distribute static content and media
  • Host entire static web sites
  • Data store for computation and large scale analytics such as financial transactions, clickstream analytics, and media transcoding
  • Backup & Archival of data
Amazon S3 doesn’t suit all storage situations. The following table presents some storage needs for which you should consider other AWS storage options: 

Read Next on EBS here

References:



Comments