Notes on @Scale Conference 2017

It's been a longtime goal of mine to attend an @Scale conference, and I finally got the opportunity to go this past August. I had an incredible time meeting extremely smart people and learning about the technologies and practices being used at a massive scale at companies like Google, Facebook, and Amazon. You can find my notes below on the talks I attended.


Overview

  • Data @ Scale: LogDevice, A file structured log system (Facebook)
  • ML @ Scale: GPUs and Deep Learning (NVIDIA)
  • Data @ Scale: Serverless at Scale with DynamoDB and Lambda (Amazon)
  • ML @ Scale: Trends and Developments in Deep Learning (Google)

Data @ Scale: LogDevice, A file structured log system

by Mark Marchukov, Facebook

Summary

Facebook's Mark Marchukov will talk about Facebook's work to build a durable and highly available sequential distributed storage system that can handle hardware failures and sustain consistent delivery at exceptionally high ingest rates.

LogDevice

  • A distributed data store designed for logs
  • Storing and serving logs is a storage problem
  • Log is a simpler abstraction than file
    • Simple and high quality of service
  • What is a log?
    • A systematic recording of events, observations, or measurements
  • Log abstraction allows capturing logs durably
  • To impose an order on some sequence of events
    • wide applicability
  • Logs are ubiquitous
    • Replication logs, write-ahead logs, transaction logs
  • Logs are like files
    • Record, oriented, append-only, trimmable files
    • The only way to write records into log is by appending
    • Once appended, the log is immutable
    • Appended by log sequence numbers
    • Monatomically increasing
  • Read by Log Sequence Number (LSN)
  • Drop oldest records
    • Space occupied by them can be reused
  • Target use cases at Facebook
    • Reliable communication between services
    • Services need to exchange data
      • Owner of services don’t need to solve various reliability issues
    • Write-ahead logging
      • Durability
    • State Replication
      • Maintaining replicas of a large distributed data store or index of data store
    • Journaling of actions for later execution
      • Take actions and write them in log of some sequence
      • Read later and execute

Use Case: Graph Index

  • State replication use case
  • Maintaining secondary indexes on a big distributed data store
  • DB updates → Index Update service → Transform, re-shard → LogDevice → {Index 1, Index 2, Index 3}
  • Index can read logs at own own pace quickly apply updates
  • Can recover from copying from another replicant
  • < 10ms
  • All updates must be in single-digit millisecond delivery
  • Once user takes an action on the site and queries index, first action must be reflected
  • Low end to end deliver latency

Use Case: ScribeX

  • Reliable and durable communication
  • Latest generation of Facebook’s Scribed service
    • A reliable fire and forget communication service that links most of Facebook's other services
    • Fire and forget API
    • Highly available service
      • Expect messages to be delivered
  • Producers write into categories
  • Consumers read into categories
  • Most important consumer is FB’s data warehouse
    • Stores various user-generated events
  • No data loss
  • Scribed: Local daemon that runs where producers run
    • Provided even if boxes are disconnected from rest of pipeline
  • ScribeX routers: Maps categories to log
    • Stores messages to LogDevice writers
  • ztail reads these messages as records
    • uncompresses, batches, delivers to consumers
  • > 1 TBps of data

  • Performance requirements
    • Communication use cases have high through put
    • State replication ahs low E2E latency requirements
    • Write-ahead low append latency
    • Everyone wants high delivery
  • LogDevice is designed to be tunable so that every use case can be tuned
  • Things common among use cases
    • High write availability
    • Durability
    • Ability to reply a day or more worth of backlog
  • If they ever lose their state because of operator error, want to rebuild it
    • Can rebuild by getting snapshot and applying historical data from logs

  • Consistency Requirements
    • Want repeatable reads
    • Deliver records in the order of their sequence number
    • No ordering requirement or guarantees across logs

  • Workload Expectations
    Sequential writing and reading per log
    • Expect sequential reading from logs
    • Not a logging use case to jump around the log
    • Often spikes in the write rate on individual logs
      • 10X-100X spikes in the write rate on the individual logs
    • Many concurrent writers per log
      • Not exclusive
    • Massive occasional backlog reading
    • LogDevice designed to handle billions of logs per cluster
      • In practice, thousands to hundreds of thousands for a single use case
      • Not one, not billions: those are very different problem

How it's delivered at TBps scale

  • Logs are simpler than files
    • Can use this to our advantage
  • POSIX files are write anywhere
  • Logs are append only
  • POSIX files: contiguous byte numbering
  • Logs: can have gaps between record numbers
  • POSIX file: space reclaimed by deleting files
  • Logs: Space reclaimed by trimming away old records
    • Often by a retention-policy that is time-based and space-based
    • Can delete a lot of data in big batch
  • POSIX files: designed for random access
  • Logs: Assume sequential reading and writing

LogDevice API

  • append(log id, payload) → log sequence number (LSN)
  • startReading(log id, from LSN, until LSN = MAX)
    • get client and call this method to start reading
  • read(count, records_out, gap_out)
    • gaps are exceptional conditions (e.g. data loss, read records before stream point of log)
    • can call this multiple times on many different logs
    • transient subscriptions

  • Design Principles
    • Separate sequencing and storage
    • Maximize placement options
    • Fully distributed metadata
      • Generally one of the hardest problems in storage
      • Want to avoid SPOF in metadata management
  • Write/Read path of log device
    • Writer sends a new append request
    • Append “foo” to log 1001
      Picks a box in a cluster and sends an append request

The Sequencer

  • Job is to issue sequence number for records
  • One sequencer per log
  • Next sequence number to assign (e.g. <3:5>)
    • Write availability optimization
    • Want to issue sequence numbers that were never used before
    • A quick way to achieve this is by bumping up the first number
    • Use zookeeper as persistent counter service for every log
    • Zookeeper guarantees number never regress

The Appender

  • An appender object is created
  • Appender gets sequence number and increments
  • If another writer wants to append, then another Appender object is created for it
  • Non-deterministic copy placement
    • Record copies are stored on randomly selected nodes, subject to constraints

Record Format

  • Log ID | LSN | Copyset | Timestamp | Payload
  • Copyset: List of nodes where copies of records will be stored

The nodeset of a log

  • The set of nodes on which records of that log may be placed
  • Nodesets for different logs may overlap
  • Nodeset can be altered (e.g. expanded) at anytime

Copysets

  • Selected at random
  • Within the nodeset of the log
  • Selected across multiple hierarchical failure domains
    • The failure of any data center will not damage r/w availability of the log
  • On failure or timeout
    • Copyset is reselected
    • Upon success, LSN is reported to writer
  • When all the nodes acknowledge a successful store, success status reported back to the writer, along with LSN assigned to record
  • When a reader wants to read record of log starting from LSN
    • Contacts entire nodeset of the log
    • Nodeset is stored in a metadata log
      • Metadata log not changed often
  • LogDevice client library puts records in order before delivering them to application
  • Data loss reporting
    • A “for-free” feature
    • Made possible by a relatively dense numbering space
    • If record is missing in the sequence, is reported as data loss to the application (to the gap_out parameter)
  • Efficiency Features
    • Client and server-side batching and zstd compression
    • Server-side copyset filtering for single copy delivery
      • Avoids this by applying server-side filtering
      • Achieve consensus that only one will deliver
    • Copyset stickiness
      • Don’t change copyset until it has to (because of failure or certain number of bytes stored)

LogsDB

  • A write-optimized local log store based on RocksRB (Facebook’s KV data store)
  • Local log store use cases
  • High IO efficiency with 100K-1M logs
  • Mostly sequential IO patterns on HDD
  • Low read-and-write amplification
    • Not repeating same writes over and over again
  • Why write-optimized datastore?
    • Many logs read immediately after being written
    • If optimize for reads, may not happen
    • As long as r/w amplification in check
  • LogsDB is a time-partition collection of RocksDB databases with a directory on top
    • Records stored as KV pairs
    • New records are inserted into an ordered memtable
      • The most recent RocksDB instance
      • An in-memory ordered data structure
    • Takes in-order data structure and converts it into a file to persist
      • Frees the memory so can repeat
    • When time comes to read, each of the files contains records of many logs, all sorted
    • When read, need to merge the files (i.e. merge sort)
    • Compaction
      • Take files and mergesort and get one file
    • LogsDB uses partitioning
      • Once an instance fills up, it is left alone and create a new one
      • End up with several RocksDB instances, uncompacted
      • Easy to tell which partition to read from
  • Space reclamation in LogsDB
    • Done in bulk by dropping oldest partition
    • If one retention-policy on logs, very easy
  • Failure Handling
    • Big distributed clusters, will fail
    • A gossip protocol
      • Detection time of a few seconds
    • Write availability is restored within a second after failure detection
  • Upon bring-up a new sequencer for log performs log recovery
  • Storage node failures trigger rebuilding

Rebuilding

  • Restoring the replication factor of records after a node or drive failure
  • Durability
    • If you lose some copy of the record, need to quickly re-replicate
    • So we don’t end up with 0 copies
  • All storage nodes in the cluster participate in rebuilding
    • all send and all received
    • In a short time, copies get re-replicated
  • Many-to-many rebuilding
    • Quickly fixing under-replication in big store clusters
    • Not limited by the capacity of a single node
    • Rebuilds at 5-10 GBps
    • Fully distributed
    • Coordinated by an RSM running on a special internal log: the event log
  • Disaggregated Clusters
    • Attempt to reduce total cost by running on different cluster types
    • Sequencers and Appenders run on CPU-specific machines
  • Very high volume logs
    • Through the separation of appenders and sequencers
    • LogDevice architecture allows to achieve unlimited r/w on a single log
    • Only thing that has to run on a single box is Sequencer
    • Everything else can run anywhere in cluster
  • Can use CPU/network of entire cluster to orchestrate events

Server-side Filtering

  • Flexible partitioning of the record stream
  • Writer sets the key
  • Reader gets based off key
  • Delivered is just the record of the key that the reader is interested in



Machine Learning @ Scale: GPUs and Deep Learning

by Robert Ober, Chief Platform Architect, NVIDIA

Summary

In the last year, GPUs plus deep learning have gone from a hot topic to large-scale production deployment in major data centers. That's because deep learning works, and the evolution of GPUs has made them a great fit for deep learning and inference. Neural nets, frameworks, and GPU architectures have changed significantly in the last year as well, allowing better solutions to be created more quickly and in more places, moving from niche applications to the mainstream. It also allows them to be used in real time for more industrial automation and human interaction roles. We talk about GPU architecture and framework evolution, scaling out and scaling up training and performance, real-time inference improvements, security plus VM isolation and management, and overall deep learning flow improvements to make development and deployment more DevOps-friendly.

  • Deep Learning affecting security, IoT, industrial automation
  • Inference
    • Put into production
  • In production, training is a process
    • Take a neural network, do training
  • Optimized for CUDA stack, built on top of GPUs
  • GPUs have grown with the advancement
  • In production, a lot of containerization
    • Makes it very easy to put into production

Deep Learning Process

  • Take one input
  • Do an inference on it through the NN
  • Compare the prediction/inference to what the result should be
  • Take the delta and back-propagate through the network, and repeat
  • Real pivot to production is being able to deploy a deep NN to production
  • Tensor is able to take the output and do compaction, optimization, and deployment
    • Results in very high performance and low latency
    • Saves money, because able to do so many more inferences so quickly
  • Daily process of training every night, deploying a new neural net each day
  • “AI factory business” as an industry

  • Top-1 Accuracy
    • First best prediction out of a neural net
  • In general, networks are getting bigger
    • Gives better accuracy
    • Drive for more accuracy and more precision in neural networks
  • Users want precision and accuracy
  • 50,000,000 parameter network
    • Inception ResNet V2
  • Commute behind each one is matrix multiplies
    • 40 giga-ops per inference
    • Multiply millions of training runs → very large number
  • 4 billion parameters
  • Jeff Dean from Google:
    • 8 billion parameters networks
  • Some publications of 140 billion parameters
  • Huge growth in computational consumption

GPUs: Architected for Deep Learning

  • Volta Architecture: Tesla V100
  • Designs that handle Pascal
  • Deep Learning is about 5 years old as a discipline
    • Incredibly young
  • Volta improved specifically for Deep Learning
  • State of the Art 2016: Matrix Math
    • Convolution:
      • 155M Parameters
      • Neurons
      • Layers
      • Iterations
      • 40 Giga-ops GOPs / Inference

  • Matrix Multiplication
    • Every feature, shader, attribute, etc.
  • Neural nets have multi-dimensional very large matrices
  • Tensor Cores:
    • Compressed models, compound operations → bigger batch, faster execution, lower power
    • Unbelievable increase in throughput
  • Double precision, integer precision, tensor core
  • Pool multiple GPUs and allow them to act as one GPU for Deep Learning
  • 21 billion transistors, 815 sq mm.
    • Largest chip you can make
  • Most large chips are mostly memory, but this one is mostly compute
    • Most powerful chip ever made
  • 80 SM, 5120 CUDA Cores, 640 Tensor Cores
  • 16 GB HBM2, 900 GB/s HBM2, 300 GB/s NVLink
  • Enables bigger, better, compute intense networks
  • 5x performance improvement over the last year

GPU Inference in Production

  • Real-time interaction, high throughput of inferences, and saving money

Deep Learning Inference Use Cases: Real Time

  • Ad placement, search, recommendation systems, speech, NLP, Smart Cities, IoT and industrial automation, inappropriate content detection
  • Smart cities
    • Many video cameras in China, not enough people to watch them all
  • All use cases come back to making money
  • De facto for benchmarking is ResNet 50

Data Center GPUs

  • V100 DGX-1, HGX-1, V100, V100 FHHL, P4

REDUX

  • Research → Production
  • Bigger, complex, better, computationally intense
  • Volta = Architected for DL Efficiency + Performance: SW + HW
  • Performance → Bigger NEtworks
  • Inference = $ : but need faster, more compute
    • Where revenue is generated
    • Need fast and low latency
  • Platform + HGX-1 Reference → Multiple Styles / Use Cases



Data @ Scale: Serverless at Scale with Amazon DynamoDB and Lambda

by Krishnan Seshadrinathan, Engineering Leader, Amazon

Summary

DynamoDB is a fully managed NoSQL database service that provides high throughput at low latency with seamless scalability. The service is the backbone for many Internet applications, handling trillions of requests daily. The scale of data that applications have to manage continues to grow rapidly, making it a challenge to manage systems and respond to events in real time. This talk will be a deep dive into the challenges of building the DynamoDB Streams feature, which provides a time-ordered sequence of item changes on DynamoDB tables, and leveraging it with AWS Lambda to reimagine large-scale applications for the cloud.

Serverless Computing

  • Build and run applications without thinking about servers
  • Developers shouldn’t need to provision and manage servers
  • Cloud provided higher level of abstraction
  • Even with cloud infrastructure, still need to think about how to scale
  • Serverless Ecosystem
    • Abstracts underlying infrastructure
    • AWS Lambda, Amazon S3, Amazon DynamoDB
  • AWS Lambda lets you run code without provisioning or managing servers
    • Scaling happens on a per request basis
    • Monitors requests and continually scales up and down
    • Infrastructure deals with fault tolerance
    • Only pay for what you consume

Run Code in Response to Events

  • Lambda provides neat and simple mechanism to process events
  • Decouples large systems, can write in a much simpler way
  • Many AWS sources now configured that can trigger events for you
    • e.g. S3 can trigger if something happens
  • Lambda functions are small, self-contained pieces of code

DynamoDB

  • Fully managed NoSQL Database
  • High durability and high throughout
  • Scale seamlessly
  • Single digit millisecond latency

Amazon’s Path to DynamoDB

  • Reflected in startups evolution toward DynamoDB
  • Historically, thousands of services were using relational databases
  • DynamoDB was built and first used by internal services at Amazon

DynamoDB

  • A managed service, AWS takes care of hard plumbing
  • Building a distributed system is hard
    • Infrastructure fails
    • No bounds on message transmission
    • Messages lost, duplicated, reordered…
  • Low latency serving 1 trillion+ request/day
  • Materialized Views
    • Use Case: Load DynamoDB data into Redshift or Elasticsearch
  • Cross Region Replication
    • Naive, dual-write approach
      • Increased latency due to dual-write
      • Maintaining correctness is hard
        • Write to first database and client crashes
        • State of two databases are inconsistent now
    • Queue approach
      • Queue writes up and propagate into two databases
      • Challenge #1: if write to queue and get a read after that, must look through queue and DB
      • Challenge #2: Can no longer support conditional-put
      • Challenge #3: As you scale, common to have 100K writes per second
        • Must now partition the queue → must also partition the system

  • Key Requirements
    • Provide a changelog of updates to DynamoDB table
    • Tenets
      • High availability and scalability
      • Durability
      • Ordered and de-duplicated
      • No impact to DynamoDB table performance

DynamoDB Streams

  • Launched 2015
  • Provides an ordered changelog for DynamoDB table
  • Highly available and supports high throughput (100K/sec)
  • Replicated across multiple AZs and available for 24 hours
  • Integrated natively with Lambda

Use Case: Real World Application

  • Emergency response application
  • User sends message: “help”
  • Real-time aggregation by zipcode
  • Scale to 100x traffic during emergencies
  • Don’t overpay for peak
  • Traditional Design
    • Application servers
    • Database servers
    • Significant portion of effort is spent on scaling
    • Challenges
      • Real Time aggregation
      • Scaling
      • Time spent on infrastructure vs. focus on outcome
  • Serverless architecture of Lambda naturally fits into microservices model
  • Turn on Streams for the table
    • Everytime a API request, Stream is invoked
    • Stream can sanitize, process the data

API Gateway

  • Create endpoints
  • Support a number of APIs
  • Define Resources and methods invoked on the resources



Machine Learning @ Scale: Trends and Developments in Deep Learning

by Rajat Monga, Engineering Director, Google Tensorflow

Summary

Deep learning is making a great impact across products at Google and in the world at large. As Google pushes the limits of AI and deep learning, research is underway in many areas. With integration into many Google products, this research is improving the lives of billions of people. Open source tools like TensorFlow and open publications put the latest deep learning research at the fingertips of engineers around the world. This talk begins by exploring what has enabled this field to evolve rapidly over the last few years. It also will cover some of the leading research advances and current trends that point to a promising future, and the algorithms that make it possible.

Deep Neural Network

  • Cat or Dog prediction neural network
  • Input layer → activated neurons → output layer

Neural Networks

  • 80s and 90s: Very useful for small number of tasks
  • Over last few years, deep neural networks are starting to do better than other approaches

Need to build the right tools

  • What do you want in a ML system
    • Ease of expression
      • For lots of crazy ML ideas/algorithms
    • Scalability
      • Can run experiments quickly
    • Portability
      • Can run on wide variety of platforms
    • Reproducibility
      • Easy to share and reproduce research
    • Production Readiness
      • Go from research to real products

Tensor Flow

  • Launched a year and a half ago
  • Goals
    • Establish common platform for ML
    • Best in the world for both research and production use
    • Bringing ML to everyone
  • Initial release in November 2015
  • Platforms supported
    • CPU, GPU, TPU, Cloud TPU, Android, iOS
  • Languages supported
    • Python, R, C++, Java, Go, Haskell

Experiment Turnaround Time and Research Productivity

  • Minutes, hours: Interactive research, instant gratification
  • 1-4 days: Tolerable, interactivity replaced by running many experiments in parallel
  • 1-4 weeks: High value experiments only, progress stalls
  • > 1 month: Not even tried

XLA: A TensorFlow Compiler

  • Generates cores for CPUs, GPUs, TPUs
  • Optimized

Deep Learning at Google

Speech Recognition

  • Acoustic Input → Deep Recurrent Neural Network → Text output (“How cold is it outside?”
  • Reduced word errors by more than 30%

Google Photos

  • Labels images in photo album (e.g. [glacier] in image)

Medical Imaging

  • Use similar model for detecting diabetic retinopathy
  • Models perform better than or on par with experts

Better Language Understanding

  • RankBrain: a deep neural network for search ranking

Smart Reply

  • Email received auto-suggests

Sequence-to-Sequence Model: Machine Translation

  • Target sentence

Google Neural Machine Translation Model

  • One model replica: one machine with 8 GPUs

Image Captioning
Ranking

Memorization + Generalization

  • Wide + Deep: Generalization + memorizing exceptions

    • Combine the power of memorization and generalization
  • Most models are trained from scratch for a single task, need lots of data

  • Current Challenges

    • Data continues to be a challenge
    • Compute
    • Experts

Multilingual Neural Machine Translation System

  • Able to translate between pairs it has never seen
  • Bigger Models, but sparsely activated
    • Human brain does not light up all the time
  • Per-Example Routing
    • For each example, can just use one part of network

Tensor Processing Unit

  • Focused on making predictions
  • Custom google-designed chip for neural net computations

Cloud TPU

  • 2nd-gen TPU
  • Up to 180 teraflops
  • 64 GB of ultra-high bandwidth memory
  • Allows training and inference

Large-Scale Neural Machine Translation

  • 24 hours to train on 32 GPUs
  • 6 hours to train on 1/8th of a TPU pod

Neural Networks

  • Expected to continue to improve and be better than other approaches

Example Queries

  • "Describe this video in Spanish"
  • "Which of these eye images show symptoms of diabetic retinopathy?"

Summaries taken from the @Scale Conference 2017 page.



-Steven Dao